Skip to content

Commit

Permalink
Render a root-cause exception for dependency and join errors (#3717)
Browse files Browse the repository at this point in the history
# Description

This PR reworks two exception types, DependencyError and JoinError. Both
of these exceptions report that a task failed because some other
task/future failed - in the dependency case, because a task dependency
failed, and in the join case because one of the tasks/futures being
joined failed.

This PR introduces a common superclass `PropagatedException` to
acknowledge that the meaning and behaviour of these two exceptions is
very similar.

`PropagatedException` has a new implementation for reporting the
failures that are being propagated. Parsl has tried a couple of ways to
do this in the past:

* The implementation immediately before this PR reports only the
immediate task IDs (or future reprs, for non-tasks) in the exception
message. For details of the chain of exceptions and
original/non-propagated exception, the user can examine the exception
object via the `dependent_exceptions_tids` attribute.

* Prior to PR #1802, the repr/str (and so the printed form) of
dependency exceptions rendered the entire exception. In the case of deep
dependency chains or where a dependency graph has many paths to a root
cause, this resulted in extremely voluminous output with a lot of boiler
plate dependency exception text.
 
The approach introduced by this current PR attempts a fusion of these
two approaches:

* The user will often be waiting only on the final task of a dependency
chain (because the DFK will be managing everything in between) - so they
will often get a dependency exception.
* When they get a dependency exception, they are likely to actually be
interested in the root cause at the earliest part of the chain. So this
PR makes dependency exceptions traverse the chain and discover a root
cause
* When there are multiple root causes, or multiple paths to the same
root cause, the user should not be overwhelmed with output. So this PR
picks a single root cause exception to report fully, and when there are
other causes/paths adds a small annotation `(+ others)`
* The user is sometimes interested in the path from that root cause
exception to the current failure, but often not. That path is rendered
roughly the same as immediately before this PR as a sequence of task IDs
(or Future reprs for non-tasks)
* Python has a native mechanism for indicating that an exception is
caused by another exception, the `__cause__` magic attribute which is
usually populated by `raise e1 from e2`. This PR populates that magic
attribute at construction so that displaying the exception will show the
cause using Python's native format.
* The user may want to ask other Parsl-relevant questions about the
exception chain, so this PR keeps the `dependent_exceptions_tids`
attribute for such introspection.

A dependency or join error is now rendered by Python as exactly two
exceptions next to each other:

```

Traceback (most recent call last):
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 922, in _unwrap_futures
    new_args.extend([self.dependency_resolver.traverse_to_unwrap(dep)])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dependency_resolvers.py", line 48, in _
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 339, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 603, in _unwrap_remote_exception_wrapper
    result.reraise()
  File "/home/benc/parsl/src/parsl/parsl/app/errors.py", line 114, in reraise
    raise v
  File "/home/benc/parsl/src/parsl/parsl/app/errors.py", line 138, in wrapper
    return func(*args, **kwargs)
^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/taskchain.py", line 13, in failer
    raise RuntimeError("example root failure")
        ^^^^^^^^^^^^^^^^^
RuntimeError: example root failure

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benc/parsl/src/parsl/taskchain.py", line 16, in <module>
    inter(inter(inter(inter(inter(failer()))))).result()
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 339, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 601, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.dataflow.errors.DependencyError: Dependency failure for task 5. The representative cause is via task 4 <- task 3 <- task 2 <- task 1 <- task 0
```



# Changed Behaviour

DependencyErrors and JoinErrors will render differently

## Type of change

- Update to human readable text: Documentation/error messages/comments
  • Loading branch information
benclifford authored Jan 27, 2025
1 parent 47e60f0 commit 49f2e10
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 18 deletions.
78 changes: 60 additions & 18 deletions parsl/dataflow/errors.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Optional, Sequence, Tuple
from typing import List, Sequence, Tuple

from parsl.errors import ParslError

Expand Down Expand Up @@ -29,35 +29,77 @@ def __str__(self) -> str:
return self.reason


class DependencyError(DataFlowException):
"""Error raised if an app cannot run because there was an error
in a dependency.
class PropagatedException(DataFlowException):
"""Error raised if an app fails because there was an error
in a related task. This is intended to be subclassed for
dependency and join_app errors.
Args:
- dependent_exceptions_tids: List of exceptions and identifiers for
dependencies which failed. The identifier might be a task ID or
the repr of a non-DFK Future.
- dependent_exceptions_tids: List of exceptions and brief descriptions
for dependencies which failed. The description might be a task ID or
the repr of a non-AppFuture.
- task_id: Task ID of the task that failed because of the dependency error
"""

def __init__(self, dependent_exceptions_tids: Sequence[Tuple[Exception, str]], task_id: int) -> None:
def __init__(self,
dependent_exceptions_tids: Sequence[Tuple[BaseException, str]],
task_id: int,
*,
failure_description: str) -> None:
self.dependent_exceptions_tids = dependent_exceptions_tids
self.task_id = task_id
self._failure_description = failure_description

(cause, cause_sequence) = self._find_any_root_cause()
self.__cause__ = cause
self._cause_sequence = cause_sequence

def __str__(self) -> str:
deps = ", ".join(tid for _exc, tid in self.dependent_exceptions_tids)
return f"Dependency failure for task {self.task_id} with failed dependencies from {deps}"
sequence_text = " <- ".join(self._cause_sequence)
return f"{self._failure_description} for task {self.task_id}. " \
f"The representative cause is via {sequence_text}"

def _find_any_root_cause(self) -> Tuple[BaseException, List[str]]:
"""Looks recursively through self.dependent_exceptions_tids to find
an exception that caused this propagated error, that is not itself
a propagated error.
"""
e: BaseException = self
dep_ids = []
while isinstance(e, PropagatedException) and len(e.dependent_exceptions_tids) >= 1:
id_txt = e.dependent_exceptions_tids[0][1]
assert isinstance(id_txt, str)
# if there are several causes for this exception, label that
# there are more so that we know that the representative fail
# sequence is not the full story.
if len(e.dependent_exceptions_tids) > 1:
id_txt += " (+ others)"
dep_ids.append(id_txt)
e = e.dependent_exceptions_tids[0][0]
return e, dep_ids


class DependencyError(PropagatedException):
"""Error raised if an app cannot run because there was an error
in a dependency. There can be several exceptions (one from each
dependency) and DependencyError collects them all together.
Args:
- dependent_exceptions_tids: List of exceptions and brief descriptions
for dependencies which failed. The description might be a task ID or
the repr of a non-AppFuture.
- task_id: Task ID of the task that failed because of the dependency error
"""
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[BaseException, str]], task_id: int) -> None:
super().__init__(dependent_exceptions_tids, task_id,
failure_description="Dependency failure")

class JoinError(DataFlowException):

class JoinError(PropagatedException):
"""Error raised if apps joining into a join_app raise exceptions.
There can be several exceptions (one from each joining app),
and JoinError collects them all together.
"""
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[BaseException, Optional[str]]], task_id: int) -> None:
self.dependent_exceptions_tids = dependent_exceptions_tids
self.task_id = task_id

def __str__(self) -> str:
dep_tids = [tid for (exception, tid) in self.dependent_exceptions_tids]
return "Join failure for task {} with failed join dependencies from tasks {}".format(self.task_id, dep_tids)
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[BaseException, str]], task_id: int) -> None:
super().__init__(dependent_exceptions_tids, task_id,
failure_description="Join failure")
6 changes: 6 additions & 0 deletions parsl/tests/test_python_apps/test_fail.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ def test_fail_sequence_first():
assert isinstance(t_final.exception().dependent_exceptions_tids[0][0], DependencyError)
assert t_final.exception().dependent_exceptions_tids[0][1].startswith("task ")

assert hasattr(t_final.exception(), '__cause__')
assert t_final.exception().__cause__ == t1.exception()


def test_fail_sequence_middle():
t1 = random_fail(fail_prob=0)
Expand All @@ -50,3 +53,6 @@ def test_fail_sequence_middle():

assert len(t_final.exception().dependent_exceptions_tids) == 1
assert isinstance(t_final.exception().dependent_exceptions_tids[0][0], ManufacturedTestFailure)

assert hasattr(t_final.exception(), '__cause__')
assert t_final.exception().__cause__ == t2.exception()

0 comments on commit 49f2e10

Please sign in to comment.