Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(state-parts-dump-check): bound process_part by timeout (#10215)
Problem: Occasionally, the monitoring will be stuck. What happens is there are a few parts for which the process_part() is called, but never finished, i.e. not retried, not success, not failed. My guess is that the get_part().await somehow is stuck waiting forever, and since there was no timeout bound for the process_part(), the tokio task that runs process_part could be hung forever, thus the monitoring app is stuck. I tried making the task for part_id = 100 sleep for 1000000 seconds, and the program finishes all other parts except this one and hangs. This is similar to what I see on monitoring node. Solution: bound process_part by a timeout, and initiate retry if timeout passed. Thus no matter for which reason the task hangs, it will retry itself. Note, the timeout is set to 10 mins because the timeout starts counting once the task is spawn, and counts real time instead of cpu time for the task, i.e. if this task is not running, time still elapse. So we need to make sure within timeout * MAX_RETRIES, we are able to finish processing all parts.
- Loading branch information