WIP: bindata/alerts/slo: improve burnrate calculation #1744

dgrisonnet · 2024-09-26T16:06:00Z

The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.

For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.

apiserver_request:burnrate1h{verb="write"} would be equal to 2/4 and apiserver_request:burnrate1h{verb="read"} would be 1/6.
The sum of these as these by the alert today would be equal to 2/4+1/6=2/3 when in reality, the ratio of successful requests should be 2/10*1/10=3/10. So there is quite a huge difference today when we don't account for the total number of requests.

The only problem we will face with this change is that the we won't be able to use the recording rules to setup different SLOs depending on the type of requests.
But this could always be addressed by changing the burn rate alert expression to the following instead of modifying the recording rules:

        sum(
          apiserver_request:burnrate1h{verb="read"}
          *
          (
            sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1h]))
            /
            sum by (cluster) (rate(apiserver_request_total{job="apiserver"}[1h]))
          )
          +
          apiserver_request:burnrate1h{verb="write"}
          *
          (
            sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))
            /
            sum by (cluster) (rate(apiserver_request_total{job="apiserver"}[1h]))
          )
        ) > (14.40 * 0.01000)

Signed-off-by: Damien Grisonnet <[email protected]>

openshift-ci · 2024-09-26T16:11:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgrisonnet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrutkovs · 2024-09-26T17:49:33Z

/cc

openshift-ci · 2024-09-26T20:21:28Z

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-upgrade	`56d01d8`	link	true	`/test e2e-aws-ovn-upgrade`
ci/prow/e2e-aws-ovn-serial	`56d01d8`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-gcp-operator-single-node	`56d01d8`	link	false	`/test e2e-gcp-operator-single-node`
ci/prow/e2e-aws-operator-disruptive-single-node	`56d01d8`	link	false	`/test e2e-aws-operator-disruptive-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

vrutkovs · 2024-10-14T14:16:53Z

That makes sense to me, other burnrates (burnrate6h etc.) should be updated as well

openshift-bot · 2025-01-13T01:00:43Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

bindata/alerts/slo: fix burnrate calculation

56d01d8

Signed-off-by: Damien Grisonnet <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2024

openshift-ci bot requested review from benluddy and deads2k September 26, 2024 16:11

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024

openshift-ci bot requested a review from vrutkovs September 26, 2024 17:49

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: bindata/alerts/slo: improve burnrate calculation #1744

WIP: bindata/alerts/slo: improve burnrate calculation #1744

dgrisonnet commented Sep 26, 2024

openshift-ci bot commented Sep 26, 2024

vrutkovs commented Sep 26, 2024

openshift-ci bot commented Sep 26, 2024

vrutkovs commented Oct 14, 2024

openshift-bot commented Jan 13, 2025

WIP: bindata/alerts/slo: improve burnrate calculation #1744

Are you sure you want to change the base?

WIP: bindata/alerts/slo: improve burnrate calculation #1744

Conversation

dgrisonnet commented Sep 26, 2024

openshift-ci bot commented Sep 26, 2024

vrutkovs commented Sep 26, 2024

openshift-ci bot commented Sep 26, 2024

vrutkovs commented Oct 14, 2024

openshift-bot commented Jan 13, 2025