ruler: add "user" and "reason" labels to ruler's queries metrics #10536

narqo · 2025-01-29T20:29:23Z

What this PR does

In this PR I'm adding the user and reason labels to the cortex_ruler_queries_failed_total and cortex_ruler_write_requests_failed_total metrics. The idea behind the change is to allow tenants segregate and track separately the failures, that happened due to issues on the server vs. those happened due to issues on the client-side (e.g. bad rule).

The exact list of "reasons" is up for a discussion. I've noticed that Loki uses error and upstream_error (code). I'm not sure if those are well translated into what we want. Opinions are welcome.

For now, I've chosen to have only two reasons:

error for everything
4xx for the case when there is a client-side error in the remote-querier — I think, this will create the most value to start with. I think, we can expand/change the groups in the future. What do you think?

Note, for consistency, I also added the user labels to cortex_ruler_write_requests_total and cortex_ruler_queries_total.

TODO

Decide the set of "reason" values
Update the existing alerts, that currently expect only server-side failures

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…x_ruler_queries metrics Signed-off-by: Vladimir Varankin <[email protected]>

Signed-off-by: Vladimir Varankin <[email protected]>

pkg/ruler/compat.go

dimitarvdimitrov

sorry for the scattered review comments

dimitarvdimitrov · 2025-02-06T09:24:15Z

...metamonitoring-values-generated/mimir-distributed/templates/metamonitoring/mixin-alerts.yaml

@@ -452,7 +452,8 @@ spec:
              runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulertoomanyfailedpushes
            expr: |
              100 * (
-              sum by (cluster, namespace, pod) (rate(cortex_ruler_write_requests_failed_total[1m]))
+              # Here it matches on empty "reason" for backwards compatibility, with when the metric didn't have this label.
+              sum by (cluster, namespace, pod) (rate(cortex_ruler_write_requests_failed_total{reason=~"(error|$^)"}[1m]))


the regex matches EndStart instead of StartEnd is this what you intended?

Ooops, thank you.

dimitarvdimitrov · 2025-02-06T09:26:04Z

pkg/ruler/compat.go

-	})
-	failedWrites := promauto.With(reg).NewCounter(prometheus.CounterOpts{
+	}, []string{"user"})
+	failedWrites := promauto.With(reg).NewCounterVec(prometheus.CounterOpts{


don't forget to make this change in GEM's codebase too

Oh. I didn't know there is an override. I will open a PR with the update there right after merging this one.

Signed-off-by: Vladimir Varankin <[email protected]>

ruler: add "user" and "reason" labels to cortex_ruler_write and corte…

47b0bc6

…x_ruler_queries metrics Signed-off-by: Vladimir Varankin <[email protected]>

narqo force-pushed the vldmr/ruler-metrics-labels branch from 83d36f9 to 47b0bc6 Compare January 30, 2025 13:37

fix integration tests

9aa955b

Signed-off-by: Vladimir Varankin <[email protected]>

narqo changed the title ~~wip! ruler: add "user" and "reason" labels to cortex_ruler_write and cortex_ruler_queries metrics~~ ruler: add "user" and "reason" labels to ruler's queries metrics Jan 31, 2025

narqo marked this pull request as ready for review January 31, 2025 11:46

narqo requested review from a team as code owners January 31, 2025 11:46

narqo added 9 commits January 31, 2025 14:31

update alerting rule

af3159a

Signed-off-by: Vladimir Varankin <[email protected]>

refactor internals

e03d921

Signed-off-by: Vladimir Varankin <[email protected]>

fixup! update alerting rule

a5b090b

rebuild assets

b1020bb

Signed-off-by: Vladimir Varankin <[email protected]>

update CHANGELOG

685efa6

Signed-off-by: Vladimir Varankin <[email protected]>

fixup! rebuild assets

71e6e76

update alerts to be backwards compatible

6379c8d

Signed-off-by: Vladimir Varankin <[email protected]>

rebuild assets

7504e11

Signed-off-by: Vladimir Varankin <[email protected]>

fixup! rebuild assets

4f0f7c7

dimitarvdimitrov reviewed Feb 6, 2025

View reviewed changes

pkg/ruler/compat.go Outdated Show resolved Hide resolved

dimitarvdimitrov reviewed Feb 6, 2025

View reviewed changes

narqo added 3 commits February 6, 2025 14:22

rename default reason labels

6a0bba3

Signed-off-by: Vladimir Varankin <[email protected]>

fix alerting rule

8a1c73e

Signed-off-by: Vladimir Varankin <[email protected]>

rebuild assets

c0fd9e6

Signed-off-by: Vladimir Varankin <[email protected]>

narqo requested a review from dimitarvdimitrov February 6, 2025 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruler: add "user" and "reason" labels to ruler's queries metrics #10536

ruler: add "user" and "reason" labels to ruler's queries metrics #10536

narqo commented Jan 29, 2025 •

edited

Loading

dimitarvdimitrov left a comment

dimitarvdimitrov Feb 6, 2025

narqo Feb 6, 2025

dimitarvdimitrov Feb 6, 2025

narqo Feb 6, 2025

ruler: add "user" and "reason" labels to ruler's queries metrics #10536

Are you sure you want to change the base?

ruler: add "user" and "reason" labels to ruler's queries metrics #10536

Conversation

narqo commented Jan 29, 2025 • edited Loading

What this PR does

Checklist

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

dimitarvdimitrov Feb 6, 2025

Choose a reason for hiding this comment

narqo Feb 6, 2025

Choose a reason for hiding this comment

dimitarvdimitrov Feb 6, 2025

Choose a reason for hiding this comment

narqo Feb 6, 2025

Choose a reason for hiding this comment

narqo commented Jan 29, 2025 •

edited

Loading