Add `MaxSurge` and `MaxUnavailable` strategy to all Loki k8 workloads. #5227

kavirajk · 2022-01-24T20:34:09Z

What this PR does / why we need it:
This PR makes two changes.

Have MaxSurge:5 and MaxUnavailable:1 for all the stateless workloads
Have MaxSurge:0 and MaxUnavailable:1 for all the stateful workloads

This fixes couple of issues.

By default these configs are 25% in k8, meaning during rollout, 25% of pods are allowed to shutdown immediately.
Due to (1), during graceful shutdown process, 25% of all the pods access consul to unregister() from shared key value.

(2) makes CAS rate of underlying KV store high (leads to lots of retry and failing) sometimes failing to unregister leaving the ring "unhealthy"

Also this PR make these configs consistent across all k8 workloads.

More details: grafana/dskit#117

Which issue(s) this PR fixes:
Fixes #5191

Special notes for your reviewer:

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

This fixes couple of issues. 1. By default these configs are 25% in k8, meaning during rollout 25% of pods are allowed to shutdown immediately. 2. Due to (1), during graceful shutdown process, 25% of all the pods access consul to `unregister()` from shared key value. (2) makes CAS rate of underlying KV store high (leads to lots of retry and failing) sometimes failing to unregister leaving the ring "unhealthy" Also this PR make these configs consistent across all k8 workloads. More details: grafana/dskit#117

DylanGuedes

Looks great!

I'll have a look at our helmcharts to have those there as well.

sandeepsukhani

I think you need to rollback statefulset changes, deployment changes makes sense to me.

sandeepsukhani · 2022-01-25T08:43:07Z

production/ksonnet/loki/boltdb_shipper.libsonnet

+    statefulSet.mixin.spec.strategy.rollingUpdate.withMaxSurge(0) +
+    statefulSet.mixin.spec.strategy.rollingUpdate.withMaxUnavailable(1)


maxSurge and maxUnavailable are not supported by statefulset. There is an open proposal to add support for maxUnavailable to statefulsets and maxSurge would not work with statefulset as per comment which makes sense.

good catch @sandeepsukhani . Fixed 👍

Signed-off-by: Kaviraj <[email protected]>

sandeepsukhani

LGTM

kavirajk requested a review from a team as a code owner January 24, 2022 20:34

pull-request-size bot added the size/M label Jan 24, 2022

DylanGuedes approved these changes Jan 24, 2022

View reviewed changes

sandeepsukhani reviewed Jan 25, 2022

View reviewed changes

Remove it from statefulset workloads

c604e2f

Signed-off-by: Kaviraj <[email protected]>

pull-request-size bot added size/S and removed size/M labels Jan 25, 2022

sandeepsukhani approved these changes Jan 25, 2022

View reviewed changes

kavirajk merged commit db283da into main Jan 25, 2022

kavirajk deleted the k8-maxsurge-maxunavailable-consistent branch January 25, 2022 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `MaxSurge` and `MaxUnavailable` strategy to all Loki k8 workloads. #5227

Add `MaxSurge` and `MaxUnavailable` strategy to all Loki k8 workloads. #5227

kavirajk commented Jan 24, 2022

DylanGuedes left a comment

sandeepsukhani left a comment

sandeepsukhani Jan 25, 2022

kavirajk Jan 25, 2022

sandeepsukhani left a comment

		statefulSet.mixin.spec.strategy.rollingUpdate.withMaxSurge(0) +
		statefulSet.mixin.spec.strategy.rollingUpdate.withMaxUnavailable(1)

Add MaxSurge and MaxUnavailable strategy to all Loki k8 workloads. #5227

Add MaxSurge and MaxUnavailable strategy to all Loki k8 workloads. #5227

Conversation

kavirajk commented Jan 24, 2022

DylanGuedes left a comment

Choose a reason for hiding this comment

sandeepsukhani left a comment

Choose a reason for hiding this comment

sandeepsukhani Jan 25, 2022

Choose a reason for hiding this comment

kavirajk Jan 25, 2022

Choose a reason for hiding this comment

sandeepsukhani left a comment

Choose a reason for hiding this comment

Add `MaxSurge` and `MaxUnavailable` strategy to all Loki k8 workloads. #5227

Add `MaxSurge` and `MaxUnavailable` strategy to all Loki k8 workloads. #5227