Alertmanager可以接收Prometheus等客户端发来的告警,之后通过分组、去重等处理,将它们通过路由发送给正确的接收器。

如果你还不熟悉alertmanager,请先参阅alertmanager

参数解释

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
# Note that this parameter is implicitly bound by Alertmanager's
# `--data.retention` configuration flag. Notifications will be resent after either
# repeat_interval or the data retention period have passed, whichever
# occurs first. `repeat_interval` should not be less than `group_interval`.
[ repeat_interval: <duration> | default = 4h ]

通知时机举例

首先假设alertmanager配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
route:
- receiver: "ice"
group_wait: "10s"
group_interval: "2m"
repeat_interval: "5m"
match:
groupTag: "ice"
receivers:
- name: "ice"
webhook_configs:
- send_resolved: true
url: "http://172.16.182.162:8080/ice"

关键时机

下图说明了当按照上面配置时,alertmanager的通知时机

工作原理

Alertmanager 在收到一条新的告警之后,会等待 group_wait 时间,对这条新的告警做一些分组、更新、静默的操作。当第一条告警经过 group_wait 时间之后,Alertmanager 会每隔 group_interval 时间检查一次这条告警,判断是否需要对这条告警进行一些操作,当 Alertmanager 经过 n 次 group_interval 的检查后,n*group_interval 恰好大于 repeat_interval 的时候,Alertmanager 才会将这条告警再次发送给对应的 receiver。

在该组的alert第一次被发送后,该组会进入睡眠/唤醒周期,睡眠周期将持续group_interval时间,在睡眠状态下该group不会进行任何发送告警的操作(但会插入/更新(根据fingerprint)group中的内容),睡眠结束后进入唤醒状态,然后检查是否需要发送新的alert或者重复已发送的alert(resolved类型的alert在发送完后会从group中剔除)。这就是group_interval的作用。

聚合组在每次唤醒才会检查上一次发送alert是否已经超过repeat_interval时间,如果超过则再次发送该告警。因此repeat_interval并不代表告警的实际重复间隔,因为在第一次发送告警的repeat_interval时间后,聚合组可能还处在睡眠状态,所以实际的告警间隔应该大于repeat_interval且小于repeat_interval+group_interval。因此实际生产中group_interval值不可设得太大。

例外情况

理想情况下,我们总是期望每个alert都有对应的resolved, 每个resolved也能找到每个对应的alert, 但是有时会有例外情况

有些resolved alert没有对应的firing alert?

为什么有些resolved alert没有对应的firing alert,因为这些firing alert发送给alertmanager时其所在的group恰好处在睡眠状态下,而其对应的resolved消息也在同一睡眠周期内被发送给alertmanager,接收到resolved消息后,group将其对应的firing消息覆盖,因此在唤醒时就只接收到了resolved消息。

有些的firing alert没有对应的resolved alert?

同理,为什么有些的firing alert没有对应的resolved alert呢?假设该firing消息发生在第n个睡眠周期,而在第n+1个睡眠周期内,该alert发生了resolved-firing-resolved…这样的状态变化,则其对应的resolved消息被n+1周期内的第二个resolved消息覆盖,因此表现为该firing alert没有对应的resolved消息。

收到多条重复的resolved alert?

为什么有些resolved消息接收到了多条?这个问题又涉及到prometheus rule组件的一个特性,当一个alert由firing变成resolved后,该resolved alert不会只发送给alertmanager一次,而是会先保存在内存中15分钟,并且重复多次发送给alertmanager,参看如下代码段

1
2
3
// resolvedRetention is the duration for which a resolved alert instance
// is kept in memory state and consequently repeatedly sent to the AlertManager.
const resolvedRetention = 15 * time.Minute

发送多条resolved的情况为:在第n个睡眠周期内,alertmanager接收到第一条resolved alert并将其更新进group,紧接着在唤醒时发送该group并将resolved alert从group中剔除。但在第n+1个睡眠周期内,prometheus仍然在向alertmanager发送该resolved alert,因此下次唤醒时发送的group中又带有这条resolved alert。

firing alert短时间发送了多次?

这个容易理解,如上所述,alertmanager发送消息的单位是group,在该group被发送的下一个睡眠周期中,又有新的alert被insert到该group中,因此下一次唤醒时又发送了一次该group,表现为同一条firing alert短时间内发送了多次。

最佳实践

如果需要严格的每20分钟发送一次告警,则可参考如下配置,每次group_interval唤醒后总会通知一次

1
2
3
group_wait: "5s"
group_interval: "20m"
repeat_interval: [0,20m)

测试程序

可通过模拟告警发送方(如Prometheus)和接收方(receiver)来观察alertmanager配置后的实际发送情况

发送方

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import cn.hutool.http.HttpUtil
import cn.hutool.log.LogFactory
import cn.hutool.log.dialect.console.ConsoleColorLogFactory
import java.time.LocalDateTime
import java.util.*
import kotlin.system.exitProcess


fun main() {
val alertUrl = "http://test1:8693/api/v1/alerts"
LogFactory.setCurrentLogFactory(ConsoleColorLogFactory())
val log = LogFactory.get("main")

var index = 0
// 30s 一次执行一个定时任务
val timer = Timer()
timer.schedule(object : TimerTask() {
override fun run() {
if (index >= 5) exitProcess(0)
//获取0001-01-01T00:00:00Z格式的时间
val time = getISO8601Time()
//发送告警
HttpUtil.post(alertUrl, getBody(++index, time))
log.info("send A${index}")
}
}, 0, 20_000)
}

fun getISO8601Time(): String = LocalDateTime.now().toString().substring(0, 19) + ".000+08:00"

fun getBody(index: Int, time: String): String = """
[{
"labels": {
"alertname": "A$index",
"groupTag": "ice",
"instance": "test1:9090",
"job": "ice",
"severity": "critical"
},
"annotations": {
"description": "test$index",
"summary": "test$index"
},
"startsAt": "$time",
"endsAt": "2023-09-01T10:59:00.000+08:00",
"generatorURL": "http://test1:9090/graph?g0.expr=up&g0.tab=1"
}]
""".trimIndent()

接收方

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import cn.hutool.core.date.DateUtil
import cn.hutool.core.lang.ConsoleTable
import org.slf4j.Logger
import org.slf4j.LoggerFactory
import org.springframework.boot.autoconfigure.SpringBootApplication
import org.springframework.boot.runApplication
import org.springframework.web.bind.annotation.GetMapping
import org.springframework.web.bind.annotation.RequestBody
import org.springframework.web.bind.annotation.RequestMapping
import org.springframework.web.bind.annotation.RestController
import kotlin.concurrent.thread
import kotlin.concurrent.timer

@SpringBootApplication
class ReceiverApplication

fun main(args: Array<String>) {
runApplication<ReceiverApplication>(*args)
}

@RestController
class AlertController {
private val log: Logger = LoggerFactory.getLogger("alert")
private val table: ConsoleTable = ConsoleTable.create()
.addHeader("时间", "间隔", "活跃告警", "解除告警", "group_interval", "repeat_interval")
private val globalTimer = DateUtil.timer()
private val row = mutableMapOf<String, String>()
private var groupThread: Thread = thread(name = "group_interval", start = false) {
timer(name = "group", period = 120_000) {
row["group_interval"] = "1"
}
}
private var repeatThread: Thread = thread(name = "repeat_interval", start = false) {
timer(name = "repeat", period = 300_000) {
row["repeat_interval"] = "1"
}
}

@GetMapping("/print")
fun print() = table.toString()

@RequestMapping("ice")
fun webhook(@RequestBody notify: Notify) {
log.info("receiver: ${notify.alerts.size} alerts")
if (groupThread.state == Thread.State.NEW) {
groupThread.start()
}
if (repeatThread.state == Thread.State.NEW) {
repeatThread.start()
}
val actives = mutableListOf<String>()
val releases = mutableListOf<String>()
notify.alerts.forEach {
when (it.status) {
"firing" -> {
actives.add(it.labels["alertname"]!!)
}
"resolved" -> {
releases.add(it.labels["alertname"]!!)
}
}
}
table.addBody(
DateUtil.now(),
globalTimer.intervalSecond().toString(),
actives.joinToString(","),
releases.joinToString(","),
row["group_interval"] ?: "",
row["repeat_interval"] ?: "",
)
row.remove("group_interval")
row.remove("repeat_interval")
}
}

配置好之后,reload alertmanager,

然后启动接收程序,最后启动发送程序,发送10个告警过后暂停一段时间,然后再发送一个告警,观察日志输出情况

最后

本文暂未考虑alertmanager集群的情况,仅考虑单个alertmanager实例