You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During previous rollouts, we saw that the errors towards customers started when the second zone (zone-b) was rolled by the rollout operator.
If we stopped the rollout after the first zone because we observed a high error rate on ingesters, we could have avoided customer impact.
We also need to pause for a minute or so after the first zone to check for errors. We have see, errors starting immediately as zone-a goes online and zone-b is terminated.
Some more thoughts from a meeting where we talked about this issue:
The problem might not be in ingesters at all, so they might not even know that something's broken and will continue to rollout.
Not sure if rollout-operator should be able to query and understand Prometheus metrics
Something that rollout-operator already does is checking whether all pods are ready/healthy thus another proposal: what if we add another annotation? E.g. grafana.com/rollout-operator/must-be-healthy: deploy/foo, which would indicate that deploy/foo should have all its pods ready & healthy in order to proceed.
Then we can run a cell-health-check deployment that would do the necessary checks: read status from memberlist, run promql, etc., and would just expose everything through it's readiness/healthiness endpoint. We could also use that deployment to export metrics that would unblock the CD process and rollout the next cell.
During previous rollouts, we saw that the errors towards customers started when the second zone (zone-b) was rolled by the rollout operator.
If we stopped the rollout after the first zone because we observed a high error rate on ingesters, we could have avoided customer impact.
We also need to pause for a minute or so after the first zone to check for errors. We have see, errors starting immediately as zone-a goes online and zone-b is terminated.
Some more thoughts from a meeting where we talked about this issue:
Then we can run a cell-health-check deployment that would do the necessary checks: read status from memberlist, run promql, etc., and would just expose everything through it's readiness/healthiness endpoint. We could also use that deployment to export metrics that would unblock the CD process and rollout the next cell.
Original authors of this issue: @krajorama, @bboreham, @colega.
The text was updated successfully, but these errors were encountered: