Prometheus Operator學習

參考文檔視頻

https://www.bilibili.com/video/BV194421U7Sn/?spm_id_from=333.1387.collection.video_card.click&vd_source=0372d3f32c3f19a6a2676a7529d6698a

https://www.bilibili.com/video/BV13Q4y1C7hS?spm_id_from=333.788.videopod.episodes&vd_source=0372d3f32c3f19a6a2676a7529d6698a&p=183

https://github.com/prometheus-operator/kube-prometheus

Prometheus Operator 介紹及其主要組件

Prometheus Operator 是一個基于 Kubernetes 自定義資源（CRD）的工具，旨在簡化 Prometheus 及其相關(guān)監(jiān)控組件在 Kubernetes 集群中的部署、配置和管理。它通過聲明式 API 實現(xiàn)監(jiān)控棧的自動化運維，大幅降低了傳統(tǒng)手動配置 Prometheus 的復雜度。

一、Prometheus Operator 的核心作用

自動化部署：通過自定義資源定義（CRD）聲明 Prometheus 服務(wù)器、告警規(guī)則、服務(wù)發(fā)現(xiàn)等，Operator 自動完成對應(yīng)資源的創(chuàng)建和更新。
動態(tài)配置：當監(jiān)控目標（如 Pod、Service）在 Kubernetes 中發(fā)生變化時，Operator 自動更新 Prometheus 配置，無需手動修改 prometheus.yml。
高可用支持：支持部署 Prometheus 集群（多副本），并通過持久化存儲確保數(shù)據(jù)不丟失。
生命周期管理：自動處理 Prometheus 版本升級、配置滾動更新等操作，減少人工干預。

二、主要組件及自定義資源（CRD）

Prometheus Operator 核心通過以下自定義資源（CRD）實現(xiàn)監(jiān)控棧的管理，每個資源對應(yīng)特定的功能：

1. Prometheus 資源

Prometheus 是最核心的 CRD，用于定義一個 Prometheus 服務(wù)器實例的部署配置。

作用：聲明 Prometheus 服務(wù)器的規(guī)格，包括副本數(shù)、存儲配置、資源限制、監(jiān)控目標選擇等。

關(guān)鍵配置示例：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: example-prometheus
  namespace: monitoring
spec:
  replicas: 2  # 高可用副本數(shù)
  retention: 15d  # 數(shù)據(jù)保留時間
  storageSpec:  # 持久化存儲配置
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 100Gi
  serviceAccountName: prometheus  # 權(quán)限賬號
  serviceMonitorSelector:  # 選擇要監(jiān)控的 ServiceMonitor
    matchLabels:
      team: frontend

 

2. ServiceMonitor 資源

ServiceMonitor 用于定義 Prometheus 如何發(fā)現(xiàn)和監(jiān)控 Kubernetes 中的 Service 及其后端 Pod。

作用：通過標簽選擇器匹配目標 Service，自動生成 Prometheus 的 scrape_configs（抓取配置），無需手動編寫服務(wù)發(fā)現(xiàn)規(guī)則。

關(guān)鍵配置示例：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: frontend-monitor
  namespace: monitoring
  labels:
    team: frontend  # 被 Prometheus 資源的 serviceMonitorSelector 匹配
spec:
  selector:  # 匹配目標 Service 的標簽
    matchLabels:
      app: frontend
  namespaceSelector:  # 限定監(jiān)控的命名空間（可選）
    matchNames:
      - app-namespace
  endpoints:  # 抓取目標的端口和路徑
    - port: web  # 對應(yīng) Service 中定義的端口名
      path: /metrics  #  metrics 暴露路徑
      interval: 30s  # 抓取間隔

 

3. PodMonitor 資源

PodMonitor 與 ServiceMonitor 類似，但直接針對 Pod 進行監(jiān)控（不依賴 Service），適用于無 Service 暴露的 Pod 場景。

作用：通過標簽選擇器匹配目標 Pod，定義抓取規(guī)則（如端口、路徑、間隔等）。
適用場景：需要直接監(jiān)控 Pod 內(nèi)部 metrics（如 DaemonSet 部署的節(jié)點監(jiān)控組件）。

4. PrometheusRule 資源

PrometheusRule 用于定義 Prometheus 的告警規(guī)則和記錄規(guī)則（Recording Rule）。

作用：替代傳統(tǒng) Prometheus 中手動編寫的 rules.yml，通過聲明式配置管理告警邏輯，Operator 會自動將規(guī)則同步到 Prometheus 服務(wù)器。

關(guān)鍵配置示例：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: high-cpu-alert
  namespace: monitoring
  labels:
    prometheus: example-prometheus  # 關(guān)聯(lián)到對應(yīng)的 Prometheus 實例
spec:
  groups:
  - name: cpu-alerts
    rules:
    - alert: HighCPUUsage
      expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.8
      for: 5m  # 持續(xù)5分鐘觸發(fā)告警
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} high CPU usage"

 

5. Alertmanager 資源

Alertmanager 用于定義 Alertmanager 實例的部署配置，負責處理 Prometheus 發(fā)送的告警信息（如去重、分組、路由到郵件 / 釘釘?shù)冉邮斩耍?/div>

作用：聲明 Alertmanager 的副本數(shù)、存儲、配置等，Operator 自動部署并關(guān)聯(lián) Prometheus。

關(guān)鍵配置示例：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: example-alertmanager
  namespace: monitoring
spec:
  replicas: 3  # 高可用副本數(shù)
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 10Gi
  configSecret: alertmanager-config  # 引用存儲告警路由配置的 Secret

 

6. ThanosRuler 資源（可選）

ThanosRuler 是與 Thanos 集成的組件，用于集中管理跨 Prometheus 集群的告警規(guī)則和記錄規(guī)則，適用于大規(guī)模監(jiān)控場景。

作用：替代單個 Prometheus 的 PrometheusRule，實現(xiàn)多集群統(tǒng)一告警規(guī)則管理。

三、Operator 控制器本身

除了上述 CRD，Prometheus Operator 核心組件是一個 控制器進程，部署在 Kubernetes 集群中（通常以 Deployment 形式運行）。

作用：監(jiān)聽上述 CRD 資源的創(chuàng)建、更新、刪除事件，根據(jù)聲明的規(guī)格（Spec）自動協(xié)調(diào)實際狀態(tài)（Status），例如：
- 當創(chuàng)建 Prometheus 資源時，控制器會生成對應(yīng)的 StatefulSet、Service 等資源。
- 當 ServiceMonitor 更新時，控制器會自動更新 Prometheus 的 scrape_configs 并熱加載配置。

四、總結(jié)

Prometheus Operator 通過自定義資源將監(jiān)控棧的配置 “Kubernetes 化”，核心組件關(guān)系如下：

Prometheus：監(jiān)控數(shù)據(jù)采集和存儲的核心服務(wù)。
ServiceMonitor/PodMonitor：定義監(jiān)控目標和抓取規(guī)則。
PrometheusRule：定義告警和記錄規(guī)則。
Alertmanager：處理和路由告警。
Operator 控制器：協(xié)調(diào)所有組件的生命周期，確保實際狀態(tài)與聲明一致。

這種架構(gòu)使得在 Kubernetes 中管理 Prometheus 監(jiān)控棧更高效、可擴展，尤其適合大規(guī)模集群或多團隊協(xié)作場景。

問題一：PrometheusRule 如何跟 alert manager 關(guān)聯(lián)？

回答一：

PrometheusRule 與 Alertmanager 的關(guān)聯(lián)是通過 Prometheus 作為中間橋梁實現(xiàn)的。具體流程如下：

一、關(guān)聯(lián)流程概述

plaintext
 
PrometheusRule → Prometheus → Alertmanager

PrometheusRule：定義告警規(guī)則（如 CPU 使用率超過 80% 觸發(fā)告警）。
Prometheus：執(zhí)行規(guī)則，當條件滿足時生成告警，并發(fā)送到配置的 Alertmanager。
Alertmanager：接收告警，進行去重、分組、路由，并發(fā)送通知（郵件、釘釘?shù)龋?/li>

二、關(guān)鍵配置步驟

1. PrometheusRule 配置（定義告警規(guī)則）

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-rules
  namespace: monitoring
  labels:
    prometheus: example-prometheus  # 與 Prometheus 資源的 selector 匹配
spec:
  groups:
  - name: example
    rules:
    - alert: HighCPUUsage
      expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.8
      for: 5m  # 持續(xù) 5 分鐘觸發(fā)告警
      labels:
        severity: warning  # 標簽用于 Alertmanager 路由
      annotations:
        summary: "Instance {{ $labels.instance }} CPU usage is high"

 

2. Prometheus 配置（關(guān)聯(lián) Alertmanager）

在 Prometheus 資源中指定 Alertmanager 地址：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: example-prometheus
  namespace: monitoring
spec:
  replicas: 2
  alerting:
    alertmanagers:
    - name: alertmanager-example  # Alertmanager 服務(wù)名稱
      namespace: monitoring
      port: web  # 端口名
      pathPrefix: "/"
  ruleSelector:  # 選擇要加載的 PrometheusRule
    matchLabels:
      prometheus: example-prometheus

 

3. Alertmanager 配置（定義路由和接收器）

通過 configSecret 引用包含 Alertmanager 配置的 Secret：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: example-alertmanager
  namespace: monitoring
spec:
  replicas: 3
  configSecret: alertmanager-config  # 引用 Secret

 

創(chuàng)建包含路由規(guī)則的 Secret（示例 alertmanager-config.yaml）：

yaml
 
 

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'severity']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'default'
      routes:
      - match:  # 根據(jù) PrometheusRule 中的 labels 匹配
          severity: critical
        receiver: 'pagerduty'
    receivers:
    - name: 'default'
      email_configs:
      - to: 'monitoring@example.com'
    - name: 'pagerduty'
      pagerduty_configs:
      - routing_key: <key>

 

三、驗證關(guān)聯(lián)是否成功

1. 檢查 Prometheus 是否加載了規(guī)則

訪問 Prometheus UI（通常通過 kubectl port-forward）：

bash
 
kubectl port-forward -n monitoring svc/prometheus-example 9090:9090

打開瀏覽器 http://localhost:9090/rules，查看是否顯示 PrometheusRule 中定義的規(guī)則。

2. 檢查 Alertmanager 是否接收到告警

訪問 Alertmanager UI：

bash
 
kubectl port-forward -n monitoring svc/alertmanager-example 9093:9093

打開瀏覽器 http://localhost:9093/#/alerts，查看是否有觸發(fā)的告警。

3. 測試告警觸發(fā)

手動觸發(fā)告警（如通過修改規(guī)則閾值），驗證 Alertmanager 是否收到并正確路由。

四、常見問題排查

1. 告警未觸發(fā)

檢查 PrometheusRule 中的表達式是否正確（可在 Prometheus UI 中測試）。
確認 PrometheusRule 的 labels 與 Prometheus 的 ruleSelector 匹配。

2. 告警未到達 Alertmanager

檢查 Prometheus 的日志，確認是否有發(fā)送告警的記錄。
檢查 Alertmanager 服務(wù)是否正常運行：
bash
kubectl get pods -n monitoring | grep alertmanager

3. Alertmanager 未發(fā)送通知

檢查 Alertmanager 配置（alertmanager.yaml）是否正確。
確認接收器（如郵件、PagerDuty）配置無誤。

五、總結(jié)

PrometheusRule 與 Alertmanager 的關(guān)聯(lián)通過以下步驟完成：

PrometheusRule 定義告警規(guī)則并添加標簽（如 severity: critical）。
Prometheus 通過 alerting.alertmanagers 指定 Alertmanager 地址，并通過 ruleSelector 加載規(guī)則。
Alertmanager 通過 route.routes.match 匹配告警標簽，實現(xiàn)不同級別告警的差異化處理。

這種分層設(shè)計使規(guī)則定義、告警生成和通知發(fā)送解耦，便于管理和擴展。

問題二：prometheus 和 serviceMonitor 如果運行在不同的namespace里，如何監(jiān)控？

回答二：

當 Prometheus 和 ServiceMonitor 運行在不同的命名空間時，需要通過以下配置實現(xiàn)跨命名空間監(jiān)控：

一、核心配置方法

1. 在 Prometheus 中配置跨命名空間選擇器

在 Prometheus 資源中添加 namespaceSelector，允許從其他命名空間發(fā)現(xiàn) ServiceMonitor：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring  # Prometheus 所在命名空間
spec:
  serviceMonitorNamespaceSelector:  # 允許監(jiān)控的命名空間
    matchNames:
      - apps  # ServiceMonitor 所在命名空間
  serviceMonitorSelector:
    matchLabels:
      team: frontend  # 匹配 ServiceMonitor 的標簽

 

2. 在 ServiceMonitor 中指定正確的服務(wù)引用

確保 ServiceMonitor 正確引用目標服務(wù)（需包含命名空間信息）：

yaml
 
 

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: frontend-monitor
  namespace: apps  # ServiceMonitor 所在命名空間
  labels:
    team: frontend  # 被 Prometheus 選擇的標簽
spec:
  selector:
    matchLabels:
      app: frontend
  namespaceSelector:
    matchNames:
      - apps  # 目標服務(wù)所在命名空間
  endpoints:
    - port: metrics  # 服務(wù)暴露的端口名
      path: /metrics

 

二、權(quán)限配置（關(guān)鍵！）

Prometheus 服務(wù)賬戶需要被授權(quán)訪問其他命名空間的資源。通過 ClusterRoleBinding 擴展權(quán)限：

yaml
 
 

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-crb  # 自定義名稱
subjects:
  - kind: ServiceAccount
    name: prometheus  # Prometheus 使用的 ServiceAccount 名稱
    namespace: monitoring  # Prometheus 所在命名空間
roleRef:
  kind: ClusterRole
  name: prometheus  # 使用已有的 Prometheus ClusterRole
  apiGroup: rbac.authorization.k8s.io

 

三、驗證配置

1. 檢查 Prometheus 是否發(fā)現(xiàn)了 ServiceMonitor

通過 Prometheus UI 查看：

bash
 
kubectl port-forward -n monitoring svc/prometheus 9090:9090

訪問 http://localhost:9090/targets，確認目標服務(wù)出現(xiàn)在監(jiān)控列表中。

2. 查看 Prometheus 配置

檢查 Prometheus 自動生成的抓取配置：

bash
 
kubectl exec -n monitoring <prometheus-pod> -- cat /etc/prometheus/config_out/prometheus.env.yaml

確認其中包含來自其他命名空間的 scrape_configs。

四、常見問題與解決方案

1. Prometheus 未發(fā)現(xiàn)跨命名空間的 ServiceMonitor

原因：未配置 serviceMonitorNamespaceSelector 或標簽不匹配。

解決方案：

yaml
 
spec:
  serviceMonitorNamespaceSelector:
    any: true  # 允許從所有命名空間發(fā)現(xiàn)
  # 或使用 matchNames 指定具體命名空間

2. 權(quán)限不足錯誤

錯誤信息："error":"Get \"http://<service-ip>/metrics\": dial tcp: permission denied"
解決方案：確保通過 ClusterRoleBinding 授予 Prometheus 跨命名權(quán)限（見步驟二）。

五、最佳實踐

使用標簽篩選：在 serviceMonitorSelector 中使用標簽過濾，避免監(jiān)控無關(guān)的 ServiceMonitor。
最小權(quán)限原則：僅授予 Prometheus 訪問必要命名空間的權(quán)限，而非 any: true。
監(jiān)控跨命名空間依賴：通過 Prometheus 監(jiān)控 ServiceMonitor 資源本身的狀態(tài)，確保配置生效。

通過以上配置，Prometheus 可以有效監(jiān)控不同命名空間中的服務(wù)，實現(xiàn)靈活的多團隊、多環(huán)境監(jiān)控隔離。

posted @ 2025-07-10 12:18 呆瓜小賊66 閱讀(92) 評論(0) 收藏舉報

刷新頁面返回頂部

呆瓜小賊66

Prometheus Operator學習

Prometheus Operator 介紹及其主要組件

一、Prometheus Operator 的核心作用

二、主要組件及自定義資源（CRD）

1. Prometheus 資源

2. ServiceMonitor 資源

3. PodMonitor 資源

4. PrometheusRule 資源

5. Alertmanager 資源

6. ThanosRuler 資源（可選）

三、Operator 控制器本身

四、總結(jié)

問題一：PrometheusRule 如何跟 alert manager 關(guān)聯(lián)？

回答一：

一、關(guān)聯(lián)流程概述

二、關(guān)鍵配置步驟

1. PrometheusRule 配置（定義告警規(guī)則）

2. Prometheus 配置（關(guān)聯(lián) Alertmanager）

3. Alertmanager 配置（定義路由和接收器）

三、驗證關(guān)聯(lián)是否成功

1. 檢查 Prometheus 是否加載了規(guī)則

2. 檢查 Alertmanager 是否接收到告警

3. 測試告警觸發(fā)

四、常見問題排查

1. 告警未觸發(fā)

2. 告警未到達 Alertmanager

3. Alertmanager 未發(fā)送通知

五、總結(jié)

問題二：prometheus 和 serviceMonitor 如果運行在不同的namespace里，如何監(jiān)控？

回答二：

一、核心配置方法

1. 在 Prometheus 中配置跨命名空間選擇器

2. 在 ServiceMonitor 中指定正確的服務(wù)引用

二、權(quán)限配置（關(guān)鍵！）

三、驗證配置

1. 檢查 Prometheus 是否發(fā)現(xiàn)了 ServiceMonitor

2. 查看 Prometheus 配置

四、常見問題與解決方案

1. Prometheus 未發(fā)現(xiàn)跨命名空間的 ServiceMonitor

2. 權(quán)限不足錯誤

五、最佳實踐

公告