-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(state-parts-dump-check): bound process_part by timeout #10215
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why would it hang, but restricting the execution time of those tasks is a good idea.
|
||
const MAX_RETRIES: u32 = 3; | ||
const MAX_RETRIES: u32 = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment why we shouldn't set this number to a very large number, like 1_000_000
.
aa16ab2
to
da2e312
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #10215 +/- ##
==========================================
- Coverage 71.88% 71.85% -0.04%
==========================================
Files 707 707
Lines 141788 141796 +8
Branches 141788 141796 +8
==========================================
- Hits 101927 101888 -39
- Misses 35147 35196 +49
+ Partials 4714 4712 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Problem:
Occasionally, the monitoring will be stuck. What happens is there are a few parts for which the process_part() is called, but never finished, i.e. not retried, not success, not failed.
My guess is that the get_part().await somehow is stuck waiting forever, and since there was no timeout bound for the process_part(), the tokio task that runs process_part could be hung forever, thus the monitoring app is stuck.
I tried making the task for part_id = 100 sleep for 1000000 seconds, and the program finishes all other parts except this one and hangs. This is similar to what I see on monitoring node.
Solution:
bound process_part by a timeout, and initiate retry if timeout passed. Thus no matter for which reason the task hangs, it will retry itself.
Note, the timeout is set to 10 mins because the timeout starts counting once the task is spawn, and counts real time instead of cpu time for the task, i.e. if this task is not running, time still elapse. So we need to make sure within timeout * MAX_RETRIES, we are able to finish processing all parts.