DM-44488: handle new pipe_base exception types in executors #303

TallJimbo · 2024-08-23T17:08:11Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2024-08-23T17:12:00Z

Codecov Report

Attention: Patch coverage is 94.23077% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.00%. Comparing base (4571279) to head (2b03758).
Report is 8 commits behind head on main.

Files	Patch %	Lines
python/lsst/ctrl/mpexec/singleQuantumExecutor.py	72.72%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #303      +/-   ##
==========================================
+ Coverage   88.95%   89.00%   +0.04%     
==========================================
  Files          50       50              
  Lines        4393     4439      +46     
  Branches      728      733       +5     
==========================================
+ Hits         3908     3951      +43     
- Misses        345      347       +2     
- Partials      140      141       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

parejkoj

I don't know what executor AP uses, but if they use this, they'll have to reconfigure the new kwarg to be True, since they want the other behavior.

Check your DO NOT MERGE commit.

python/lsst/ctrl/mpexec/cli/script/run.py

python/lsst/ctrl/mpexec/cli/script/run_qbb.py

python/lsst/ctrl/mpexec/separablePipelineExecutor.py

python/lsst/ctrl/mpexec/simple_pipeline_executor.py

python/lsst/ctrl/mpexec/singleQuantumExecutor.py

parejkoj · 2024-08-23T21:08:40Z

python/lsst/ctrl/mpexec/singleQuantumExecutor.py

+                    "Incorrect use of AnnotatedPartialOutputsError: no chained exception found.",
+                    task_node.label,
+                    quantum.dataId,
+                )


Oh, I like that. I think it should maybe be an error, though, not a warning, so it's more visible? That seems like a serious coding problem that should be loud.

I don't really like ERROR or higher log messages that don't correspond to an exception or some other actual execution stoppage. I thought about making this a hard failure, but I figured we'd be much more frustrated if it brought down processing that was otherwise working as intended.

There are ERROR, FAILURE, EXCEPTION log types; I wish we'd used FAILURE for what you're describing, so that we can use ERROR for other things ("be louder than warning, but I (maybe) didn't stop running").

Note that exception is not a distinct log level; it's just error with automatic stack trace printing.

parejkoj · 2024-08-23T21:09:23Z

python/lsst/ctrl/mpexec/singleQuantumExecutor.py

+            else:
+                error = caught.__cause__
+            if self.raise_on_partial_outputs:
+                # This is a real edge case that required some experimentation:


I think I'd put a NOTE: or something at the start of this, as it's sort of a meta-comment.

parejkoj · 2024-08-23T21:10:39Z

python/lsst/ctrl/mpexec/singleQuantumExecutor.py

+                    "Task '%s' on quantum %s exited with partial outputs; "
+                    "considering this a qualified success and proceeding.",


That's a good way to phrase it. I do wonder if it shouldn't be error here, or exception maybe (we're allowed to log an error even if the task doesn't halt execution, right?), so it doesn't get lost in our massive field of warnings, but I guess that's up to DRP to decide.

I'm not sure if we're allowed to log error if we don't halt, but I don't like doing it because I think there's no common mental model of what it means.

In fact, we explicitly can! Per the only place I know of that we've written any such recommendation:
https://developer.lsst.io/stack/logging.html#log-levels

ERROR: for errors that may still allow the execution to continue.

Ok, I'll switch to ERROR here.

parejkoj · 2024-08-23T21:11:02Z

python/lsst/ctrl/mpexec/singleQuantumExecutor.py

+                    task_node.label,
+                    quantum.dataId,
+                )
+                _LOG.warning(error, exc_info=error)


Or make this one log.exception? Not sure.

(see other threads)

Logging it right before it's reraised leads to duplication in logs at higher levels.

TallJimbo · 2024-08-26T15:02:43Z

I don't know what executor AP uses, but if they use this, they'll have to reconfigure the new kwarg to be True, since they want the other behavior.

The Prompt Processing framework uses SeparablePipelineExecutor, and yes, @kfindeisen et all will probably want to start passing raise_on_partial_outputs=True at construction there after this merges.

Check your DO NOT MERGE commit.

Yup, this is needed to keep the rest of the GitHub Actions green, so our workflow is to drop it only after merging the branch for the upstream package.

kfindeisen · 2024-08-26T16:17:19Z

The Prompt Processing framework uses SeparablePipelineExecutor, and yes, @kfindeisen et all will probably want to start passing raise_on_partial_outputs=True at construction there after this merges.

If you're making a behavioral and/or breaking change to the API, will this be announced/documented anywhere? ATM I have no basis for a decision either way.

TallJimbo · 2024-08-26T16:44:49Z

It's establishing behavior that I think was basically ill-defined before; RFC-958 defined two options but did not state which would be the default, IIRC, and this ticket should really be considered a belated part of that RFC's implementation.

As to whether Prompt Processing actually wants to raise when calibrateImage produces only partial outputs, @parejkoj has been adamant that the answer is "yes", while I'm a little less certain, but am willing to defer to the AP team on this. (It's not that I expect the AP pipeline to be able to do much with those partial outputs, but failing a little later when a downstream task doesn't find what it needs - as DRP often would in these cases - seems fine.)

I'll make a community post before I merge this in any case; there is some chance we'll want to do a bit of work on some downstream tasks as we get used to the full implications of this RFC.

parejkoj · 2024-08-26T22:35:08Z

Maybe better to leave this for discussion on Community, but I do wonder whether the default for this should be to fail on partial output errors, with DRP changing it in their pipelines? I don't know what we should train the users to expect, when running pipelines themselves: "reproducible errors halt" or "reproducible errors continue and may or may not cause problems further down the line"?

TallJimbo · 2024-08-27T13:49:58Z

I do wonder whether the default for this should be to fail on partial output errors, with DRP changing it in their pipelines? I don't know what we should train the users to expect, when running pipelines themselves: "reproducible errors halt" or "reproducible errors continue and may or may not cause problems further down the line"?

The vast majority of reproducible errors still halt, regardless of the new option - it's just the ones that also have a potentially-useful partial output that now default to proceeding. I think that's consistent with us wanting to process as far as we can in a lot of contexts (at least Rapid Analysis as well as DRP, and I think it's plausible alert production in special programs could go this direction, too, even if it doesn't in WFD). This does put the burden on the downstream tasks to handle those partial outputs, and I do suspect some of them are not ready for that responsibility yet (in particular, we'll get some confusing error messages from them at first). But I also think that's the only place that responsibility can realistically be.

This reverts commit 7b1fce0. This logging was duplicative when seen from STDERR, which also gets exception tracebacks that propagate up, but we need error messages and tracebacks to appear in the saved logs, too, and this is the only place we can do that.

TallJimbo force-pushed the tickets/DM-44488 branch 2 times, most recently from 2584189 to c46bfe6 Compare August 23, 2024 19:48

parejkoj reviewed Aug 23, 2024

View reviewed changes

TallJimbo added 2 commits August 26, 2024 11:00

Annotate re-raised exception instead of logging it.

7b1fce0

Logging it right before it's reraised leads to duplication in logs at higher levels.

Clarify no-work-found log message.

d11a362

TallJimbo force-pushed the tickets/DM-44488 branch from 29a9c77 to 508d294 Compare August 26, 2024 15:00

TallJimbo added 3 commits August 26, 2024 11:15

Handle AnnotatedPartialOutputsError in executors.

6a86ecf

Add --raise-on-partial-outputs option to pipetask run[-qbb].

37fb6b3

Add changelog entry.

5ef49be

TallJimbo force-pushed the tickets/DM-44488 branch from 508d294 to ed11321 Compare August 26, 2024 15:18

TallJimbo added 2 commits August 28, 2024 10:24

Upgrade AnnotatedPartialOutputsError logging to ERROR level.

2b03758

TallJimbo force-pushed the tickets/DM-44488 branch from 742c47e to 2b03758 Compare August 28, 2024 14:25

TallJimbo merged commit b13e38d into main Aug 28, 2024
14 checks passed

TallJimbo deleted the tickets/DM-44488 branch August 28, 2024 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-44488: handle new pipe_base exception types in executors #303

DM-44488: handle new pipe_base exception types in executors #303

TallJimbo commented Aug 23, 2024 •

edited

Loading

codecov bot commented Aug 23, 2024 •

edited

Loading

parejkoj left a comment

parejkoj Aug 23, 2024

TallJimbo Aug 26, 2024

parejkoj Aug 26, 2024

kfindeisen Aug 27, 2024

parejkoj Aug 23, 2024

parejkoj Aug 23, 2024

TallJimbo Aug 26, 2024

parejkoj Aug 26, 2024

TallJimbo Aug 27, 2024

parejkoj Aug 23, 2024

TallJimbo Aug 26, 2024

TallJimbo commented Aug 26, 2024 •

edited

Loading

kfindeisen commented Aug 26, 2024

TallJimbo commented Aug 26, 2024

parejkoj commented Aug 26, 2024

TallJimbo commented Aug 27, 2024

		"Task '%s' on quantum %s exited with partial outputs; "
		"considering this a qualified success and proceeding.",

DM-44488: handle new pipe_base exception types in executors #303

DM-44488: handle new pipe_base exception types in executors #303

Conversation

TallJimbo commented Aug 23, 2024 • edited Loading

Checklist

codecov bot commented Aug 23, 2024 • edited Loading

Codecov Report

parejkoj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo commented Aug 26, 2024 • edited Loading

kfindeisen commented Aug 26, 2024

TallJimbo commented Aug 26, 2024

parejkoj commented Aug 26, 2024

TallJimbo commented Aug 27, 2024

TallJimbo commented Aug 23, 2024 •

edited

Loading

codecov bot commented Aug 23, 2024 •

edited

Loading

TallJimbo commented Aug 26, 2024 •

edited

Loading