-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: bindata/alerts/slo: improve burnrate calculation #1744
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Damien Grisonnet <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
@dgrisonnet: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
That makes sense to me, other burnrates ( |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.
For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.
apiserver_request:burnrate1h{verb="write"} would be equal to
2/4
and apiserver_request:burnrate1h{verb="read"} would be1/6
.The sum of these as these by the alert today would be equal to
2/4+1/6=2/3
when in reality, the ratio of successful requests should be2/10*1/10=3/10
. So there is quite a huge difference today when we don't account for the total number of requests.The only problem we will face with this change is that the we won't be able to use the recording rules to setup different SLOs depending on the type of requests.
But this could always be addressed by changing the burn rate alert expression to the following instead of modifying the recording rules: