-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(operator): Operator panic's when reconnect fails after max retries #1692
fix(operator): Operator panic's when reconnect fails after max retries #1692
Conversation
core/retry.go
Outdated
RespondToTaskV2MaxElapsedTime = 0 // Maximum time all retries may take. `0` corresponds to no limit on the time of the retries. | ||
RespondToTaskV2NumRetries = 0 // Total number of retries attempted. If 0, retries indefinitely until maxElapsedTime is reached. | ||
|
||
// Retry Parameters for SubscribeToNewTasksV3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Retry Parameters for SubscribeToNewTasksV3 | |
// Retry Parameters for SubscribeToNewTasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works in my machine! Maybe we should add logs for each retry to warn that we are trying to reconnect. Otherwise, it might seem like nothing is happening.
Agreed! @JuArce should we do this in a separate pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced not shutting down is the best solution. I'm putting this on hold until the description tells in which scenario the subscription is retrying for a long time but the node is working
Closing in favor of #1729 |
Fix: Operator Panic when reconnect fails after max Retries
Description
Closes #960.
SubscribeToNewTasksV2Retryable
andSubscribeToNewTasksV3Retryable
are called when the operator starts .SubscribeToNewTasksV2Retryable
andSubscribeToNewTasksV3Retryable
both spawn a go-routine to handle errors from the subscription to two rpc's a primary and a fallback. Upon erroring the go-routine attempts to subscribe again up to a retry limit.When
SubscribeToNewTasksV2Retryable
andSubscribeToNewTasksV3Retryable
exhaust the maximum number of retries the respective subscribe functions panics due to dereferencing anil
reference within the auto-generated contract bindings.In the case that one of the subscriptions errors and then fails to re-connect the operator will panic while one subscription (the fallback subscription) is still available and the operator can continue to function as normal.
In the case that both connections fail error and fail to reconnect then the operator should shut down.
If the operator fails to subscribe initially the operator should shut down.
This pr addressed this issue by:
1.) Adding a
defer()
to the go routine the converts the panic into an error before the operator process exits.2.) Changing the retry parameters for
SubscribeToNewTasksV2Retryable
andSubscribeToNewTasksV3Retryable
to retry infinitely (up to retry intervals of 60 sec).To Test:
brew install nginx
config-files
within the aligned-layer repo to use:8082/main/
port and path.make anvil_start_with_block_time
docker compose -f nginx-docker-compose.yaml up -d
docker compose -f nginx-docker-compose.yaml stop
docker compose -f nginx-docker-compose.yaml up -d
&& restart the aggregatormake aggregator_start
Type of change
Please delete options that are not relevant.
Checklist
testnet
, everything else tostaging