基于指标进行告警

2024-09-08

在这篇教程中，我们将对在使用 Go 编写的 HTTP 服务器进行监控教程中得到的ping_request_count指标配置告警。

出于本教程的目的，我们将告警设置为ping_request_count指标大于5的情况。请参阅实际情况下的最佳实践以了解关于告警原则的更多信息。

在这里下载适用于你操作系统的最新 Alertmanager 发布版本。

Alertmanager 支持各种接收器（receiver），如email、webhook、pagerduty、slack等，它可以通过这些接收器在触发告警时通知用户。你可以在这里找到接收器列表以及如何配置它们的详细信息。我们将使用webhook作为接收器来进行此教程，转到 webhook.site 并复制稍后用于配置 Alertmanager 的 webhook URL。

首先，让我们通过 webhook 接收器来设置 Alertmanager。

alertmanager.yml

global:
  resolve_timeout: 5m
route:
  receiver: webhook_receiver
receivers:
    - name: webhook_receiver
      webhook_configs:
        - url: '<INSERT-YOUR-WEBHOOK>'
          send_resolved: false

用我们之前在alertmanager.yml文件中复制的 webhook 替换<INSERT-YOUR-WEBHOOK>，然后使用以下命令运行 Alertmanager。

alertmanager --config.file=alertmanager.yml

一旦 Alertmanager 启动并运行，你就可以访问 http://localhost:9093。

在配置了 webhook 接收器之后，我们现在可以添加告警规则到 Prometheus 配置。

prometheus.yml

global:
 scrape_interval: 15s
 evaluation_interval: 10s
rule_files:
  - rules.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093
scrape_configs:
 - job_name: prometheus
   static_configs:
       - targets: ["localhost:9090"]
 - job_name: simple_server
   static_configs:
       - targets: ["localhost:8090"]

如上配置所示，evaluation_interval、rule_files和alerting部分已被添加到 Prometheus 配置中。其中，evaluation_interval定义了规则被计算的间隔，rule_files接受一个 yaml 规则文件数组，而alerting部分定义了Alertmanager 配置。正如本教程开头所说的一样，我们将创建一个基本规则：ping_request_count值大于5时触发告警。

rules.yml

groups:
 - name: Count greater than 5
   rules:
   - alert: CountGreaterThan5
     expr: ping_request_count > 5
     for: 10s

现在让我们使用以下命令运行 Prometheus。

prometheus --config.file=./prometheus.yml

打开 http://localhost:9090/rules 在浏览器中查看规则。接下来，运行我们监控的 ping 服务器，并访问 http://localhost:8090/ping 端点，并刷新页面至少6次。你可以通过访问 http://localhost:8090/metrics 端点检查 ping 计数。要查看告警的状态，请访问 http://localhost:9090/alerts。一旦ping_request_count > 5条件在10秒内为真，告警状态将变为FIRING。现在，跳转回你的webhook.siteURL，你就会看到告警消息。

基于同样的原理和配置，Alertmanager 也可以与其他接收器一起工作，并以其他方式在触发告警时通知用户。

该文档基于 Prometheus 官方文档翻译而成。