Prometheus has this line in its docs for recording rules:
Recording and alerting rules exist in a rule group. Rules within a group are run sequentially at a regular interval, with the same evaluation time.
Recording Rules
I read that a while ago, but at the time it wasn’t clear why it mattered. It seemed that groups were mostly intended to give a collection of recording rules a name. It became clear recently when I tried to set up a recording rule in one group that was using a metric produced by a recording rule in another group.
The expression for the first recording rule was something like this:
(
sum(rate(http:requests[5m]))
-
sum(rate(http:low_latency[5m]))
)
/
(
sum(rate(http:requests[5m]))
)
The result:
It’s showing a ratio of “slow” requests as a value from 0 to 1. Compare that graph to one that’s based on the raw metric, and not the pre-calculated one:
The expression is:
(
sum(rate(http_requests_seconds_count{somelabel="filter"}[5m]))
-
sum(rate(http_requests_seconds_bucket{somelabel="filter", le="1"}[5m]))
)
/
(
sum(rate(http_requests_seconds_count{somelabel="filter"}[5m]))
)
The metrics used here correspond to the pre-calculated ones above. That is, http:requests
is http_requests_seconds_count{somelabel="filter"}
, and http:low_latency
is http_requests_seconds_bucket{somelabel="filter", le="1"}
. The graphs are similar, but the one using raw metrics doesn’t have the strange sharp spikes and drops.
I’m not sure what’s going on here exactly, but based on the explanation from the docs it’s probably a race between the evaluation of the two groups resulting in inconsistent number of samples used for http:requests
and http:low_latency
. Maybe one has one less sample than the other at the time they’re evaluated for the first group’s expression, which I think could show up as spikes.
Whatever the cause the solution is simple: if one recording rule uses metrics produced by another make sure they’re in the same group.