Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout operator should stop rollout if a zone produces a high error rate #178

Open
armandgrillet opened this issue Oct 30, 2024 · 0 comments

Comments

@armandgrillet
Copy link

During previous rollouts, we saw that the errors towards customers started when the second zone (zone-b) was rolled by the rollout operator.

If we stopped the rollout after the first zone because we observed a high error rate on ingesters, we could have avoided customer impact.

We also need to pause for a minute or so after the first zone to check for errors. We have see, errors starting immediately as zone-a goes online and zone-b is terminated.

Some more thoughts from a meeting where we talked about this issue:

  • The problem might not be in ingesters at all, so they might not even know that something's broken and will continue to rollout.
  • Not sure if rollout-operator should be able to query and understand Prometheus metrics
  • Something that rollout-operator already does is checking whether all pods are ready/healthy thus another proposal: what if we add another annotation? E.g. grafana.com/rollout-operator/must-be-healthy: deploy/foo, which would indicate that deploy/foo should have all its pods ready & healthy in order to proceed.

Then we can run a cell-health-check deployment that would do the necessary checks: read status from memberlist, run promql, etc., and would just expose everything through it's readiness/healthiness endpoint. We could also use that deployment to export metrics that would unblock the CD process and rollout the next cell.

Original authors of this issue: @krajorama, @bboreham, @colega.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant