prometheus

tony2099 included in monitor micro

2021-07-24 684 words 4 minutes

Contents

know the basic knowledage of promethueus
create a basic server to monitor redis

1. what is ？

prometheus:

prom: 主要是存储采集的数据
exporter: 采集数据
grafana: 前端，展示数据

exporter:

采集和转换数据
不存储数据，只有 metrics 调用时候才采集和转换

Time Series Data

what:

相比mysql, 只有 2-3 列， key, value, timestamp
数据是基于timestamp 产生的
适用于监控

Prometheus time series data:

meteria, label: 相当于 key
value
timestamp

1
2


- `http_requests_total{method="GET", status="200"} 1023 @ 1670000000`
- `http_requests_total{method="POST", status="200"} 978 @ 1670000000`

promql

语法:

1

function(metric{lable})[time range ]

basic:

request_total: 瞬时值；当前最新的值；
request_total[5m]: 区间值， 5m 所有上报的值

聚合操作

sum
min
count

storage

can store in a remote storag;

config

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    metrics_path: /metrics
    scheme: http

    static_configs:
      - targets: ["localhost:9090"]

metadata;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


{
  "discoveredLabels": {
      "__address__": "localhost:9090",
      "__metrics_path__": "/metrics",
      "__scheme__": "http",
      "job": "prometheus"
  },
  "labels": {
      "instance": "localhost:9090",
      "job": "prometheus"
  },
  "scrapePool": "prometheus",
  "scrapeUrl": "http://localhost:9090/metrics",
  "globalUrl": "http://f9cdf3cbf0ac:9090/metrics",
  "lastError": "out of bounds",
  "lastScrape": "2021-07-26T15:03:04.589192767Z",
  "lastScrapeDuration": 0.0173774,
  "health": "down"
}

get http:localhost:9090/metrics;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


- job_name: baidu_http2xx_probe
  params:
    module:
    - http_2xx
    target:
    - baidu.com
  metrics_path: /probe
  static_configs:
  - targets:
    - 127.0.0.1:9115

get 127.0.0.1:9115/probe?params=xxxx;

targeType:
1. static
2. dynamic
  1. json file
  2. dicvoery center;

job_name:“ab” static_config:
- targets:[“localhost:300”]
job_name: ‘file_ds’ file_sd_configs:
- files:
  - targets.json

targets: the

pomethus

指标类型

在 pomethus 数据库里存储的是 metrics and value and timestamp;

api_request_count: 100; 2023-12-05 11:00:00

在客户端，可以分成几个metrics type

counter:
1. 只会增加；
2. 使用场景: 请求数
3. 服务重启后会归零

1
2


request_count  1 2023-12-05 11:00:00
request_count 1 2023-12-05 12:00:00

gauge:
1. 可以增加和减少;
2. 可能会少采集数据，需要设置合理间隔；
3. 使用场景: 日常的指标

1
2


temperuate  20 2023-12-05 11:00:00
temperature 15 2023-12-05 12:00:00

Histogram, Summary:

使用三个metric 来暴露一组 historygram, summay

1
2
3
4
5


prometheus_tsdb_compaction_chunk_range_bucket{le="25600"} 100

prometheus_tsdb_compaction_chunk_range_bucket{le="102400"} 200

prometheus_tsdb_compaction_chunk_range_bucket{le="409600"} 300

1
2

request_duration_seconds_bucket{le="0.5"} 2 2023-12-05 11:00:00 request_duration_seconds_bucket{le="1.0"} 5 2023-12-05 11:00:00

使用

increse the counter
alert if the error

pomethus

the total process:

alert –> router –> receiver.

alert rules

groups:

帮助组织rules, 使得更可读
不影响routing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


groups:
	- name:
	. rules:
		- alert: ApiErrorCountIncreased
		  expr: increase(api_error_count[1m]) >= 10
		  for:
		  lables:	
			  servitiry:
		  annotation:

    - name:
	  rules:

lable: 分组和

annotation:

alert manager

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


route:
	group_by:
	group_wait:
	group_interval:

	routes:
	 - mathch:
		 severity:"eeror"
	  receiver:xx
     - match:
	

receivers:
- name: 
	

rules

alert managrer

route:

what: send alert to

alert first group by gropy by
use group rules to when to fire the notificaition

receiver:

define the.

group_wait: 30s group_interval: 5m repeat_interval: 1h

group_wait: 1ms: 等待同一组的其他通知1m，看是否有新的通知，有的话一起发出
group_interval: 5m：两个同组的通知至少间隔 5m
repeat_interval: 1h: 如果持续触发通知，则至少一小时后才再发一次

increaese(request_count) > 10[1m];

10:00, 触发， 10:01 等待没有新的同组通知：发出通知持续到 10:02结束； 10:04:00 新的触发，由于 < 5m，如果持续激活 >=5: 10:07 发出通知一直持续到 11:04：在发出通知

alert manager config

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


groups:
- name: example_alerts
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency of job myjob
      description: This alert fires when the mean request latency of myjob is above 0.5s for 10 minutes.
      dashboard: http://example.com/dashboard/myjob

label: the metadata for a alert,

routing
grouping annotation: added data for notify, used in notificaiton

alert status:

pending: wait for time
firing：after pending, not actualy firing, may be wait for in group
resolved: the metric that cause the alert not exist

group by: