alertmanager通知时机 - 超凡’s Blog

Alertmanager可以接收Prometheus等客户端发来的告警，之后通过分组、去重等处理，将它们通过路由发送给正确的接收器。

如果你还不熟悉alertmanager，请先参阅alertmanager

参数解释

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]
 
# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]
 
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
# Note that this parameter is implicitly bound by Alertmanager's
# `--data.retention` configuration flag. Notifications will be resent after either
# repeat_interval or the data retention period have passed, whichever
# occurs first. `repeat_interval` should not be less than `group_interval`.
[ repeat_interval: <duration> | default = 4h ]

通知时机举例

首先假设alertmanager配置如下：

route:
  - receiver: "ice"
    group_wait: "10s"
    group_interval: "2m"
    repeat_interval: "5m"
    match:
      groupTag: "ice"
receivers:
- name: "ice"
  webhook_configs:
  - send_resolved: true
    url: "http://172.16.182.162:8080/ice"

关键时机

下图说明了当按照上面配置时，alertmanager的通知时机

工作原理

Alertmanager 在收到一条新的告警之后，会等待 group_wait 时间，对这条新的告警做一些分组、更新、静默的操作。当第一条告警经过 group_wait 时间之后，Alertmanager 会每隔 group_interval 时间检查一次这条告警，判断是否需要对这条告警进行一些操作，当 Alertmanager 经过 n 次 group_interval 的检查后，n*group_interval 恰好大于 repeat_interval 的时候，Alertmanager 才会将这条告警再次发送给对应的 receiver。

在该组的alert第一次被发送后，该组会进入睡眠/唤醒周期，睡眠周期将持续group_interval时间，在睡眠状态下该group不会进行任何发送告警的操作（但会插入/更新(根据fingerprint)group中的内容），睡眠结束后进入唤醒状态，然后检查是否需要发送新的alert或者重复已发送的alert(resolved类型的alert在发送完后会从group中剔除)。这就是group_interval的作用。

聚合组在每次唤醒才会检查上一次发送alert是否已经超过repeat_interval时间，如果超过则再次发送该告警。因此repeat_interval并不代表告警的实际重复间隔，因为在第一次发送告警的repeat_interval时间后，聚合组可能还处在睡眠状态，所以实际的告警间隔应该大于repeat_interval且小于repeat_interval+group_interval。因此实际生产中group_interval值不可设得太大。

例外情况

理想情况下，我们总是期望每个alert都有对应的resolved, 每个resolved也能找到每个对应的alert, 但是有时会有例外情况

有些resolved alert没有对应的firing alert？

为什么有些resolved alert没有对应的firing alert，因为这些firing alert发送给alertmanager时其所在的group恰好处在睡眠状态下，而其对应的resolved消息也在同一睡眠周期内被发送给alertmanager，接收到resolved消息后，group将其对应的firing消息覆盖，因此在唤醒时就只接收到了resolved消息。

有些的firing alert没有对应的resolved alert？

同理，为什么有些的firing alert没有对应的resolved alert呢？假设该firing消息发生在第n个睡眠周期，而在第n+1个睡眠周期内，该alert发生了resolved-firing-resolved…这样的状态变化，则其对应的resolved消息被n+1周期内的第二个resolved消息覆盖，因此表现为该firing alert没有对应的resolved消息。

收到多条重复的resolved alert？

为什么有些resolved消息接收到了多条？这个问题又涉及到prometheus rule组件的一个特性，当一个alert由firing变成resolved后，该resolved alert不会只发送给alertmanager一次，而是会先保存在内存中15分钟，并且重复多次发送给alertmanager，参看如下代码段

1
2
3

// resolvedRetention is the duration for which a resolved alert instance
// is kept in memory state and consequently repeatedly sent to the AlertManager.
const resolvedRetention = 15 * time.Minute

发送多条resolved的情况为：在第n个睡眠周期内，alertmanager接收到第一条resolved alert并将其更新进group，紧接着在唤醒时发送该group并将resolved alert从group中剔除。但在第n+1个睡眠周期内，prometheus仍然在向alertmanager发送该resolved alert，因此下次唤醒时发送的group中又带有这条resolved alert。

firing alert短时间发送了多次？

这个容易理解，如上所述，alertmanager发送消息的单位是group，在该group被发送的下一个睡眠周期中，又有新的alert被insert到该group中，因此下一次唤醒时又发送了一次该group，表现为同一条firing alert短时间内发送了多次。

最佳实践

如果需要严格的每20分钟发送一次告警，则可参考如下配置，每次group_interval唤醒后总会通知一次

1
2
3

group_wait: "5s"
group_interval: "20m"
repeat_interval: [0,20m)

测试程序

可通过模拟告警发送方（如Prometheus）和接收方（receiver）来观察alertmanager配置后的实际发送情况

发送方

import cn.hutool.http.HttpUtil
import cn.hutool.log.LogFactory
import cn.hutool.log.dialect.console.ConsoleColorLogFactory
import java.time.LocalDateTime
import java.util.*
import kotlin.system.exitProcess


fun main() {
  val alertUrl = "http://test1:8693/api/v1/alerts"
  LogFactory.setCurrentLogFactory(ConsoleColorLogFactory())
  val log = LogFactory.get("main")

  var index = 0
  // 30s 一次执行一个定时任务
  val timer = Timer()
  timer.schedule(object : TimerTask() {
    override fun run() {
      if (index >= 5) exitProcess(0)
      //获取0001-01-01T00:00:00Z格式的时间
      val time = getISO8601Time()
      //发送告警
      HttpUtil.post(alertUrl, getBody(++index, time))
      log.info("send A${index}")
    }
  }, 0, 20_000)
}

fun getISO8601Time(): String = LocalDateTime.now().toString().substring(0, 19) + ".000+08:00"

fun getBody(index: Int, time: String): String = """
    [{
      "labels": {
        "alertname": "A$index",
        "groupTag": "ice",
        "instance": "test1:9090",
        "job": "ice",
        "severity": "critical"
      },
      "annotations": {
        "description": "test$index",
        "summary": "test$index"
      },
      "startsAt": "$time",
      "endsAt": "2023-09-01T10:59:00.000+08:00",
      "generatorURL": "http://test1:9090/graph?g0.expr=up&g0.tab=1"
    }]
  """.trimIndent()

接收方

import cn.hutool.core.date.DateUtil
import cn.hutool.core.lang.ConsoleTable
import org.slf4j.Logger
import org.slf4j.LoggerFactory
import org.springframework.boot.autoconfigure.SpringBootApplication
import org.springframework.boot.runApplication
import org.springframework.web.bind.annotation.GetMapping
import org.springframework.web.bind.annotation.RequestBody
import org.springframework.web.bind.annotation.RequestMapping
import org.springframework.web.bind.annotation.RestController
import kotlin.concurrent.thread
import kotlin.concurrent.timer

@SpringBootApplication
class ReceiverApplication

fun main(args: Array<String>) {
  runApplication<ReceiverApplication>(*args)
}

@RestController
class AlertController {
  private val log: Logger = LoggerFactory.getLogger("alert")
  private val table: ConsoleTable = ConsoleTable.create()
    .addHeader("时间", "间隔", "活跃告警", "解除告警", "group_interval", "repeat_interval")
  private val globalTimer = DateUtil.timer()
  private val row = mutableMapOf<String, String>()
  private var groupThread: Thread = thread(name = "group_interval", start = false) {
    timer(name = "group", period = 120_000) {
      row["group_interval"] = "1"
    }
  }
  private var repeatThread: Thread = thread(name = "repeat_interval", start = false) {
    timer(name = "repeat", period = 300_000) {
      row["repeat_interval"] = "1"
    }
  }

  @GetMapping("/print")
  fun print() = table.toString()

  @RequestMapping("ice")
  fun webhook(@RequestBody notify: Notify) {
    log.info("receiver: ${notify.alerts.size} alerts")
    if (groupThread.state == Thread.State.NEW) {
      groupThread.start()
    }
    if (repeatThread.state == Thread.State.NEW) {
      repeatThread.start()
    }
    val actives = mutableListOf<String>()
    val releases = mutableListOf<String>()
    notify.alerts.forEach {
      when (it.status) {
        "firing" -> {
          actives.add(it.labels["alertname"]!!)
        }
        "resolved" -> {
          releases.add(it.labels["alertname"]!!)
        }
      }
    }
    table.addBody(
      DateUtil.now(),
      globalTimer.intervalSecond().toString(),
      actives.joinToString(","),
      releases.joinToString(","),
      row["group_interval"] ?: "",
      row["repeat_interval"] ?: "",
    )
    row.remove("group_interval")
    row.remove("repeat_interval")
  }
}