Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruler: add "user" and "reason" labels to ruler's queries metrics #10536

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

narqo
Copy link
Contributor

@narqo narqo commented Jan 29, 2025

What this PR does

In this PR I'm adding the user and reason labels to the cortex_ruler_queries_failed_total and cortex_ruler_write_requests_failed_total metrics. The idea behind the change is to allow tenants segregate and track separately the failures, that happened due to issues on the server vs. those happened due to issues on the client-side (e.g. bad rule).

The exact list of "reasons" is up for a discussion. I've noticed that Loki uses error and upstream_error (code). I'm not sure if those are well translated into what we want. Opinions are welcome.

For now, I've chosen to have only two reasons:

  • error for everything
  • 4xx for the case when there is a client-side error in the remote-querier — I think, this will create the most value to start with. I think, we can expand/change the groups in the future. What do you think?

Note, for consistency, I also added the user labels to cortex_ruler_write_requests_total and cortex_ruler_queries_total.


TODO

  • Decide the set of "reason" values
  • Update the existing alerts, that currently expect only server-side failures

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

…x_ruler_queries metrics

Signed-off-by: Vladimir Varankin <[email protected]>
@narqo narqo force-pushed the vldmr/ruler-metrics-labels branch from 83d36f9 to 47b0bc6 Compare January 30, 2025 13:37
Signed-off-by: Vladimir Varankin <[email protected]>
@narqo narqo changed the title wip! ruler: add "user" and "reason" labels to cortex_ruler_write and cortex_ruler_queries metrics ruler: add "user" and "reason" labels to ruler's queries metrics Jan 31, 2025
@narqo narqo marked this pull request as ready for review January 31, 2025 11:46
@narqo narqo requested review from a team as code owners January 31, 2025 11:46
narqo added 9 commits January 31, 2025 14:31
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the scattered review comments

@@ -452,7 +452,8 @@ spec:
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulertoomanyfailedpushes
expr: |
100 * (
sum by (cluster, namespace, pod) (rate(cortex_ruler_write_requests_failed_total[1m]))
# Here it matches on empty "reason" for backwards compatibility, with when the metric didn't have this label.
sum by (cluster, namespace, pod) (rate(cortex_ruler_write_requests_failed_total{reason=~"(error|$^)"}[1m]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the regex matches EndStart instead of StartEnd is this what you intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, thank you.

})
failedWrites := promauto.With(reg).NewCounter(prometheus.CounterOpts{
}, []string{"user"})
failedWrites := promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to make this change in GEM's codebase too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I didn't know there is an override. I will open a PR with the update there right after merging this one.

narqo added 3 commits February 6, 2025 14:22
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants