-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MaxSurge
and MaxUnavailable
strategy to all Loki k8 workloads.
#5227
Conversation
This fixes couple of issues. 1. By default these configs are 25% in k8, meaning during rollout 25% of pods are allowed to shutdown immediately. 2. Due to (1), during graceful shutdown process, 25% of all the pods access consul to `unregister()` from shared key value. (2) makes CAS rate of underlying KV store high (leads to lots of retry and failing) sometimes failing to unregister leaving the ring "unhealthy" Also this PR make these configs consistent across all k8 workloads. More details: grafana/dskit#117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I'll have a look at our helmcharts to have those there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to rollback statefulset changes, deployment changes makes sense to me.
statefulSet.mixin.spec.strategy.rollingUpdate.withMaxSurge(0) + | ||
statefulSet.mixin.spec.strategy.rollingUpdate.withMaxUnavailable(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxSurge and maxUnavailable are not supported by statefulset. There is an open proposal to add support for maxUnavailable to statefulsets and maxSurge would not work with statefulset as per comment which makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch @sandeepsukhani . Fixed 👍
Signed-off-by: Kaviraj <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does / why we need it:
This PR makes two changes.
MaxSurge:5
andMaxUnavailable:1
for all the stateless workloadsMaxSurge:0
andMaxUnavailable:1
for all the stateful workloadsThis fixes couple of issues.
unregister()
from shared key value.(2) makes CAS rate of underlying KV store high (leads to lots of retry and failing) sometimes failing to unregister leaving the ring "unhealthy"
Also this PR make these configs consistent across all k8 workloads.
More details: grafana/dskit#117
Which issue(s) this PR fixes:
Fixes #5191
Special notes for your reviewer:
Checklist
CHANGELOG.md
about the changes.