Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: daily reprocess job interferes with NPM follower (#731)
The daily reprocessing job enqueues 60,000 packages all at once every day to be reprocessed. This completely overloads the ECS cluster, with the following consequences: * On-demand work (like the NPM follower) suffers from being drowned out by 60k work items, and our latency on processing package updates spikes to 30 minutes or sometimes even more than an hour, triggering alarms. * Since there is no backpressure, we keep on hammering the ECS cluster trying to start tasks, retrying until it succeeds. Eventually this sometimes still fails and messages end up in the DLQ, triggering alarms and requiring human intervention. This PR mellows the reprocessing driver out a bit, by not having it fill the worker queue as fast as it can: instead, we will feed the system work at about the pace we know it can process it, with a little extra margin for on-demand work. We do that by sleeping in between the redrive batches. Based on estimes I've done, it takes on average about 4 minutes to process a single package. We use 1000 ECS workers to process 60,000 items in 4 hours. This PR raises the delay between batches of 1000 jobs to 5 minutes (20% margin). This will make us process the 60k items in 5 hours instead of 4, but at least we won't be triggering as many alarms as we used to. Backpressure and fair queueing would have been a better solution, but that requires a more thorough rearchitecting of the system. In the mean time, this solution will alleviate the worst pressure. The parameters of the wait calculation are directly available in the code, and will be easy to change once this is deployed and we can see how it behaves. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
- Loading branch information