-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ruler: add "user" and "reason" labels to ruler's queries metrics #10536
base: main
Are you sure you want to change the base?
Conversation
…x_ruler_queries metrics Signed-off-by: Vladimir Varankin <[email protected]>
83d36f9
to
47b0bc6
Compare
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the scattered review comments
@@ -452,7 +452,8 @@ spec: | |||
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulertoomanyfailedpushes | |||
expr: | | |||
100 * ( | |||
sum by (cluster, namespace, pod) (rate(cortex_ruler_write_requests_failed_total[1m])) | |||
# Here it matches on empty "reason" for backwards compatibility, with when the metric didn't have this label. | |||
sum by (cluster, namespace, pod) (rate(cortex_ruler_write_requests_failed_total{reason=~"(error|$^)"}[1m])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the regex matches EndStart
instead of StartEnd
is this what you intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, thank you.
}) | ||
failedWrites := promauto.With(reg).NewCounter(prometheus.CounterOpts{ | ||
}, []string{"user"}) | ||
failedWrites := promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to make this change in GEM's codebase too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I didn't know there is an override. I will open a PR with the update there right after merging this one.
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
Signed-off-by: Vladimir Varankin <[email protected]>
What this PR does
In this PR I'm adding the
user
andreason
labels to thecortex_ruler_queries_failed_total
andcortex_ruler_write_requests_failed_total
metrics. The idea behind the change is to allow tenants segregate and track separately the failures, that happened due to issues on the server vs. those happened due to issues on the client-side (e.g. bad rule).The exact list of "reasons" is up for a discussion. I've noticed that Loki uses
error
andupstream_error
(code). I'm not sure if those are well translated into what we want. Opinions are welcome.For now, I've chosen to have only two reasons:
error
for everything4xx
for the case when there is a client-side error in the remote-querier — I think, this will create the most value to start with. I think, we can expand/change the groups in the future. What do you think?Note, for consistency, I also added the
user
labels tocortex_ruler_write_requests_total
andcortex_ruler_queries_total
.TODO
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.