Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful error handling #1

Open
3 tasks
w568w opened this issue Sep 30, 2024 · 2 comments
Open
3 tasks

Graceful error handling #1

w568w opened this issue Sep 30, 2024 · 2 comments
Assignees
Labels
complexity: high Requires fundemental changes or thorough insight on the whole project. enhancement New feature or request

Comments

@w568w
Copy link
Owner

w568w commented Sep 30, 2024

Currently, we have not considered error handling, although we have established an error struct UError. Generally, we believe that:

  1. Any errors on either end should be reported to the client and the controller as much as possible, unless the client or the controller can no longer be contacted;
  2. In interactive tasks, any errors on either end should terminate the entire task. The control end is responsible for coordinating communication with the computing nodes and the client;

As for how to handle database errors in the controller (e.g., the database is no longer accessible or cannot be written to), it is currently undetermined.

Personally, I believe we should continue to operate or consider it an error merely for the current task, but immediately terminating the entire controller also seems reasonable.


See #1 (comment) for details on these parts.

@w568w w568w added the enhancement New feature or request label Sep 30, 2024
@w568w w568w self-assigned this Sep 30, 2024
@w568w w568w added the complexity: high Requires fundemental changes or thorough insight on the whole project. label Sep 30, 2024
@w568w
Copy link
Owner Author

w568w commented Sep 30, 2024

On the one hand, my personal error handling scheme is as follows:

Case 1. When error occurs at the controller…

When an error occurs at the controller, it should immediately terminate the entire task. If the task is interactive, the controller should attempt to notify the client about the error before doing so.

Case 2. When error occurs at the compute node…

Node should notify the controller about the error, and the controller should terminate the entire task at once. If the task is interactive, both the controller and the node should try to inform the client about the error prior to termination.

Case 3. When error occurs at the client…

No action is required; it should simply die. If it is an interactive task, an error will be occurred at the compute node and/or the controller. The first one that calls onTaskErrored wins.


But on the other hand, we can consider the mechanism from the perspective of the CAP theorem. When an error occurs:

  1. Instant Consistency: any active nodes involved (controller, client, compute node) should be aware of the error and respond appropriately.
  2. Eventual Consistency: nodes that have failed and subsequently come back online should eventually become aware of the error and handle it correctly.

I think Eventual Consistency is a different feature which is nice if implemented, but we won't discuss about it in this issue.

The problem is that, our current scheme does not ensure Instant Consistency, either: consider when the compute node cannot connect to the controller, and an error occurs at the compute node. Since the client does not pass anything to the controller (Case 3), the controller will never know about the true error.

@w568w
Copy link
Owner Author

w568w commented Dec 12, 2024

I've thought about this question these days and found that our core problem is the mixed error pattern:

  • For each 1-to-1 RPC, the caller is responsible to handle the error returned by the remote callee. This is the common synchronous way;
  • When a network partition occurs, errors have to be relayed by another node; or when the client can no longer connect to the computing node, the client also needs to report the error to the control node. In both cases, errors are received by the error handling function, so they are asynchronous errors.

This pattern presents some technical challenges:

  1. How to determine the final state of a task when multiple errors arrive synchronously and/or asynchronously?
  2. How to broadcast an asynchronous error report among the nodes?
  3. How to synchronize the lagging state of a node (client, computing node, or control node) when it comes back online?

This thread is worth reading, as discussed about how asynchronous errors should be handled.

I thought it would be better to break this issue into several parts to work on.

@w568w w568w changed the title Gracefully error handling Graceful error handling Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: high Requires fundemental changes or thorough insight on the whole project. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant