Prometheus+grafana+alertmanager監控系統安裝
Prometheus是基于go語言開發的,可以支持多種語言客戶端
Prometheus下載:https://prometheus.io/download/

1、安裝Prometheus
~]# wget https://github.com/prometheus/prometheus/releases/download/v2.37.6/prometheus-2.37.6.linux-amd64.tar.gz ~]# tar xf prometheus-2.37.6.linux-amd64.tar.gz -C /approot1/prometheus/
修改配置文件 prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.53.180:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "conf/rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'node-exporter'
file_sd_configs:
- files:
- 'conf/json/node-exporter-*.json'
- job_name: 'redis-exporter'
file_sd_configs:
- files:
- 'conf/json/redis-exporter-*.json'
- job_name: 'mysql-exporter'
file_sd_configs:
- files:
- 'conf/json/mysql-exporter-*.json'
- job_name: 'nginx-exporter'
file_sd_configs:
- files:
- 'conf/json/nginx-exporter-*.json'
- job_name: 'blackbox-exporter'
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- files:
- 'conf/json/blackbox-exporter-*.json'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.53.181:9006 # Blackbox Exporter 的 IP:端口
- job_name: 'jmx-exporter'
file_sd_configs:
- files:
- 'conf/json/jmx-exporter-*.json'
- job_name: 'docker-exporter'
file_sd_configs:
- files:
- 'conf/json/docker-exporter-*.json'
- job_name: 'api-exporter'
scrape_interval: 15s
metrics_path: /actuator/prometheus
file_sd_configs:
- files:
- 'conf/json/api-exporter-*.json'
告警規則配置文件:
普通告警模板:
https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware
k8s告警模板:
https://awesome-prometheus-alerts.grep.to/rules#kubernetes
[root@k8s-master1 prometheus-2.37.6.linux-amd64]# mkdir -p conf/{json,rules}
[root@k8s-master1 prometheus-2.37.6.linux-amd64]# cd conf/rules/
[root@k8s-master1 prometheus-2.37.6.linux-amd64]# vim node-rule.yml
node-rule.yml
groups:
- name: example
rules:
- alert: HighNginxServerRequests
expr: sum(irate(nginx_server_requests{instance="181-nginx", code="2xx"}[5m])) by (code)>1000
for: 2s
labels:
severity: critical
annotations:
summary: "High Nginx Server Requests"
description: "在最近2s鐘時間,nginx服務請求數達到了1000次"
- name: 物理節點狀態-監控告警
rules:
- alert: 物理節點cpu使用率
expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 10
for: 2s
labels:
severity: ccritical
annotations:
summary: "{{ $labels.instance }}cpu使用率過高"
description: "{{ $labels.instance }}的cpu使用率超過10%,當前使用率[{{ $value }}],需要排查處理"
- alert: 物理節點內存使用率
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 20
for: 2s
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }}內存使用率過高"
description: "{{ $labels.instance }}的內存使用率超過20%,當前使用率[{{ $value }}],需要排查處理"
- alert: InstanceDown
expr: up == 0
for: 2s
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }}: 服務器宕機"
description: "{{ $labels.instance }}: 服務器延時超過2分鐘"
- alert: 物理節點磁盤的IO性能
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 流入磁盤IO使用率過高!"
description: "{{$labels.mountpoint }} 流入磁盤IO大于60%(目前使用:{{$value}})"
- alert: 入網流量帶寬
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 流入網絡帶寬過高!"
description: "{{$labels.mountpoint }}流入網絡帶寬持續5分鐘高于100M. RX帶寬使用率{{$value}}"
- alert: 出網流量帶寬
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 流出網絡帶寬過高!"
description: "{{$labels.mountpoint }}流出網絡帶寬持續5分鐘高于100M. RX帶寬使用率{{$value}}"
- alert: TCP會話
expr: node_netstat_Tcp_CurrEstab > 1000
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
- alert: 磁盤容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 磁盤分區使用率過高!"
description: "{{$labels.mountpoint }} 磁盤分區使用大于80%(目前使用:{{$value}}%)"
創建prometheus服務啟停程序文件,并啟動服務:
/usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
ExecStart=/approot1/prometheus/prometheus-2.37.6.linux-amd64/prometheus \
--config.file=/approot1/prometheus/prometheus-2.37.6.linux-amd64/prometheus.yml \
--web.listen-address=:9090 \
--web.enable-lifecycle \
--storage.tsdb.path=/approot1/prometheus/prometheus-2.37.6.linux-amd64/data \
--storage.tsdb.retention.time=15d
Restart=always
[Install]
WantedBy=multi-user.target
# 如果需要啟用Prometheus熱加載配置,需要添加:--web.enable-lifecycle 開啟生命周期管理,不然無法通過 /-/reload 接口觸發熱加載
重載配置
# systemctl daemon-reload
啟動prometheus服務
# systemctl start promethues
配置開機自啟動
# systemctl enable prometheus.service

2、安裝被監控端安裝node_exporter(默認系統監控項),可以根據監控需求安裝相對應得exporter
官網下載安裝包:https://prometheus.io/download/
安裝node_exporter,并啟動程序
[root@k8s-node1 ~]# mkdir exporter
[root@k8s-node1 ~]# cd exporter
[root@k8s-node1 exporter]# tar xvf node_exporter-1.1.2.linux-amd64.tar.gz
[root@k8s-node1 exporter]# cd node_exporter-1.1.2.linux-amd64/
[root@k8s-node1 node_exporter-1.1.2.linux-amd64]# ./node_exporter --web.listen-address=:9004 &
#啟動命令說明
./node_exporter #啟動node_exporter
--web.listen-address=:9004 #prometheus獲取node_exporter數據端口,已指定。
啟動之后,通過ps -ef | grep node_exporter 查詢是否有相應的進程
創建prometheus自動發現的json文件
文件格式與prometheus.yml中file_sd_configs.files要一致:conf/json/node-exporter-*.json
[root@k8s-master1 ~]# cd /approot1/prometheus/prometheus-2.37.6.linux-amd64/conf/json
[root@k8s-master1 json]# cat node-exporter-test-2025.json
[
{
"labels": {
"env": "test",
"name": "k8s-node1",
"instance": "181-node"
},
"targets": [
"192.168.53.181:9004"
]
},
{
"labels": {
"env": "test",
"name": "k8s-node2",
"instance": "182-node"
},
"targets": [
"192.168.53.182:9004"
]
}
]
web頁面查看監控節點信息,我只有一個節點,182沒有啟動(只是為了演示多節點的配置)

把182節點信息從node-exporter-test-2025.json刪除

web頁面重新查看


3、安裝grafana
默認得Prometheus頁面沒有那么直觀,安裝grafana是為了頁面顯示更加直觀
下載地址1:https://grafana.com/grafana/download?pg=graf&plcmt=deploy-box-1
下載地址2:https://mirrors.tuna.tsinghua.edu.cn/grafana/yum/rpm/Package/
[root@k8s-master1 prometheus]# yum install -y ./grafana-9.3.6-1.x86_64.rpm
啟動grafana
[root@xianchaomaster1 ~]# systemctl start grafana-server
[root@xianchaomaster1 ~]# systemctl enable grafana-server
默認端口:3000
默認用戶名/密碼:admin/admin
添加數據源


導入監控看板
監控面板下載地址:https://grafana.com/grafana/dashboards/


監控面板labels添加說明:
面板導入之后可以看到上面有以下信息:
interval??環境??主機名??節點
這寫內容是讀取的node-exporter-test-2025.json文件中的labels內容

如果不知道labels中的key是什么可以打開面板的json文件搜索,比如:
環境:

主機名:

然后再去修改node-exporter-test-2025.json,配置對應的labels就可以了
4、配置發送告警服務alertmanager
Alertmanager下載地址:https://github.com/prometheus/alertmanager/releases
開啟163郵箱smtp

新增授權密碼
安裝alertmanager
[root@k8s-master1 prometheus]# tar xvf alertmanager-0.25.0.linux-386.tar.gz
[root@k8s-master1 prometheus]# cd alertmanager-0.25.0.linux-386/
[root@k8s-master1 alertmanager-0.25.0.linux-386]# cp alertmanager.yml{,.bak}
[root@k8s-master1 alertmanager-0.25.0.linux-386]# vim alertmanager.yml
alertmanager.yml
global:
resolve_timeout: 1m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '發件人@163.com'
smtp_auth_username: '發件人@163.com'
smtp_auth_password: '授權碼'
smtp_require_tls: false
route:
group_by: [alertname]
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
receiver: default-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: '收件郵箱'
send_resolved: true
global(全局發件參數)
| 字段 | ??值 | ??含義 |
| resolve_timeout | 1m | 警報從 觸發→解除 的最大等待時間;1m 表示 1 分鐘內若 Prometheus 沒標記 resolved,Alertmanager 就認為是“仍著火” |
| smtp_smarthost | 'smtp.163.com:25' | 163 郵箱的 SMTP 服務器:端口; ? 25 = 明文; ? 465 = SSL; ? 587 = STARTTLS(推薦,防攔截) |
| smtp_from | '發件人@163.com' | 發件人(必須是你 163 賬號的同域地址) |
| smtp_auth_username | '發件人@163.com' | SMTP 登錄賬號(163 要求 = 發件人) |
| smtp_auth_password | '******************' | 163 授權碼(不是登錄密碼! 在 163 郵箱 → 設置 → POP3/SMTP → 生成授權碼 |
| smtp_require_tls | false | 關閉 TLS(25 端口常被攔截/限速,建議改 587 + true) |
route(路由規則)
| 字段 | 值 | 含義 |
|---|---|---|
| group_by | [alertname] | 把 同名警報 聚成一條通知(避免轟炸) |
| group_wait | 10s | 第一批警報到達后 等 10 秒 看有沒有同名警報,一起發 |
| group_interval | 10s | 同一組 下一次通知的間隔(10s 內不再重復發) |
| repeat_interval | 10m | 相同組 若一直未解除,每 10 分鐘 再發一次提醒 |
| receiver | default-receiver | 指向下面 receivers.name 的引用 |
receivers(收件人列表)
| 字段 | 值 | 含義 |
|---|---|---|
| name | default-receiver | 被 route.receiver 引用的名字 |
| email_configs.to | 收件人@163.com | 最終收件人(可以是任意郵箱 |
| send_resolved | true | 警報 解除后 也發一封“已恢復”郵件 |
創建alertmanager.service服務啟停控制文件
端口要和prometheus.yml文件中alerting下配置的一致
cat /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
After=network.target
[Service]
User=root
Group=root
ExecStart=/approot1/prometheus/alertmanager-0.25.0.linux-386/alertmanager \
--config.file=/approot1/prometheus/alertmanager-0.25.0.linux-386/alertmanager.yml \
--storage.path=/approot1/prometheus/alertmanager-0.25.0.linux-386/alertmanager-0.25.0.linux-386/ \
--web.listen-address=":9093"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
重載systemd配置并啟動服務
systemctl daemon-reload
systemctl start alertmanager
添加開機自啟動:
systemctl enable alertmanager
瀏覽器訪問驗證:http://192.168.53.180:9093/

5、模擬出發告警發送通知
因為服務器配置不高,已經觸CPU發告警了
告警郵件發送成功

prometheus UI頁面查看也能看到觸發告警

手動模擬告警
修改node-rules.yml
讓磁盤容量使用率大于10%就告警
- alert: 磁盤容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 10
重啟prometheus服務讓告警規則快速更新成我們修改后的值

查看prometheus可以看到告警已觸發

告警郵件

告警恢復測試
把剛剛修改的內容的值改回80,熱加載配置:curl -X POST http://192.168.53.180:9090/-/reload

告警恢復郵件


浙公網安備 33010602011771號