Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MaxSurge and MaxUnavailable strategy to all Loki k8 workloads. #5227

Merged
merged 2 commits into from
Jan 25, 2022

Conversation

kavirajk
Copy link
Contributor

What this PR does / why we need it:
This PR makes two changes.

  1. Have MaxSurge:5 and MaxUnavailable:1 for all the stateless workloads
  2. Have MaxSurge:0 and MaxUnavailable:1 for all the stateful workloads

This fixes couple of issues.

  1. By default these configs are 25% in k8, meaning during rollout, 25% of pods are allowed to shutdown immediately.
  2. Due to (1), during graceful shutdown process, 25% of all the pods access consul to unregister() from shared key value.

(2) makes CAS rate of underlying KV store high (leads to lots of retry and failing) sometimes failing to unregister leaving the ring "unhealthy"

Also this PR make these configs consistent across all k8 workloads.

More details: grafana/dskit#117

Which issue(s) this PR fixes:
Fixes #5191

Special notes for your reviewer:

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

This fixes couple of issues.
1. By default these configs are 25% in k8, meaning during rollout 25% of pods are allowed to shutdown immediately.
2. Due to (1), during graceful shutdown process, 25% of all the pods access consul to `unregister()` from shared key value.

(2) makes CAS rate of underlying KV store high (leads to lots of retry and failing) sometimes failing to unregister leaving the ring "unhealthy"

Also this PR make these configs consistent across all k8 workloads.

More details: grafana/dskit#117
@kavirajk kavirajk requested a review from a team as a code owner January 24, 2022 20:34
Copy link
Contributor

@DylanGuedes DylanGuedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I'll have a look at our helmcharts to have those there as well.

Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to rollback statefulset changes, deployment changes makes sense to me.

Comment on lines 74 to 75
statefulSet.mixin.spec.strategy.rollingUpdate.withMaxSurge(0) +
statefulSet.mixin.spec.strategy.rollingUpdate.withMaxUnavailable(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxSurge and maxUnavailable are not supported by statefulset. There is an open proposal to add support for maxUnavailable to statefulsets and maxSurge would not work with statefulset as per comment which makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @sandeepsukhani . Fixed 👍

@pull-request-size pull-request-size bot added size/S and removed size/M labels Jan 25, 2022
Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kavirajk kavirajk merged commit db283da into main Jan 25, 2022
@kavirajk kavirajk deleted the k8-maxsurge-maxunavailable-consistent branch January 25, 2022 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributor ring not removing it's key from kvstore during shutdown.
3 participants