Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-44488: handle new pipe_base exception types in executors #303

Merged
merged 7 commits into from
Aug 28, 2024

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Aug 23, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 94.23077% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.00%. Comparing base (4571279) to head (2b03758).
Report is 8 commits behind head on main.

Files Patch % Lines
python/lsst/ctrl/mpexec/singleQuantumExecutor.py 72.72% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #303      +/-   ##
==========================================
+ Coverage   88.95%   89.00%   +0.04%     
==========================================
  Files          50       50              
  Lines        4393     4439      +46     
  Branches      728      733       +5     
==========================================
+ Hits         3908     3951      +43     
- Misses        345      347       +2     
- Partials      140      141       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TallJimbo TallJimbo force-pushed the tickets/DM-44488 branch 2 times, most recently from 2584189 to c46bfe6 Compare August 23, 2024 19:48
Copy link
Contributor

@parejkoj parejkoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what executor AP uses, but if they use this, they'll have to reconfigure the new kwarg to be True, since they want the other behavior.

Check your DO NOT MERGE commit.

python/lsst/ctrl/mpexec/cli/script/run.py Outdated Show resolved Hide resolved
python/lsst/ctrl/mpexec/cli/script/run_qbb.py Outdated Show resolved Hide resolved
python/lsst/ctrl/mpexec/separablePipelineExecutor.py Outdated Show resolved Hide resolved
python/lsst/ctrl/mpexec/simple_pipeline_executor.py Outdated Show resolved Hide resolved
python/lsst/ctrl/mpexec/simple_pipeline_executor.py Outdated Show resolved Hide resolved
python/lsst/ctrl/mpexec/singleQuantumExecutor.py Outdated Show resolved Hide resolved
Comment on lines +490 to +500
"Incorrect use of AnnotatedPartialOutputsError: no chained exception found.",
task_node.label,
quantum.dataId,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I like that. I think it should maybe be an error, though, not a warning, so it's more visible? That seems like a serious coding problem that should be loud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like ERROR or higher log messages that don't correspond to an exception or some other actual execution stoppage. I thought about making this a hard failure, but I figured we'd be much more frustrated if it brought down processing that was otherwise working as intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are ERROR, FAILURE, EXCEPTION log types; I wish we'd used FAILURE for what you're describing, so that we can use ERROR for other things ("be louder than warning, but I (maybe) didn't stop running").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that exception is not a distinct log level; it's just error with automatic stack trace printing.

else:
error = caught.__cause__
if self.raise_on_partial_outputs:
# This is a real edge case that required some experimentation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd put a NOTE: or something at the start of this, as it's sort of a meta-comment.

Comment on lines +507 to +515
"Task '%s' on quantum %s exited with partial outputs; "
"considering this a qualified success and proceeding.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good way to phrase it. I do wonder if it shouldn't be error here, or exception maybe (we're allowed to log an error even if the task doesn't halt execution, right?), so it doesn't get lost in our massive field of warnings, but I guess that's up to DRP to decide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we're allowed to log error if we don't halt, but I don't like doing it because I think there's no common mental model of what it means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we explicitly can! Per the only place I know of that we've written any such recommendation:
https://developer.lsst.io/stack/logging.html#log-levels

ERROR: for errors that may still allow the execution to continue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll switch to ERROR here.

task_node.label,
quantum.dataId,
)
_LOG.warning(error, exc_info=error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make this one log.exception? Not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see other threads)

Logging it right before it's reraised leads to duplication in logs at
higher levels.
@TallJimbo
Copy link
Member Author

TallJimbo commented Aug 26, 2024

I don't know what executor AP uses, but if they use this, they'll have to reconfigure the new kwarg to be True, since they want the other behavior.

The Prompt Processing framework uses SeparablePipelineExecutor, and yes, @kfindeisen et all will probably want to start passing raise_on_partial_outputs=True at construction there after this merges.

Check your DO NOT MERGE commit.

Yup, this is needed to keep the rest of the GitHub Actions green, so our workflow is to drop it only after merging the branch for the upstream package.

@kfindeisen
Copy link
Member

The Prompt Processing framework uses SeparablePipelineExecutor, and yes, @kfindeisen et all will probably want to start passing raise_on_partial_outputs=True at construction there after this merges.

If you're making a behavioral and/or breaking change to the API, will this be announced/documented anywhere? ATM I have no basis for a decision either way.

@TallJimbo
Copy link
Member Author

It's establishing behavior that I think was basically ill-defined before; RFC-958 defined two options but did not state which would be the default, IIRC, and this ticket should really be considered a belated part of that RFC's implementation.

As to whether Prompt Processing actually wants to raise when calibrateImage produces only partial outputs, @parejkoj has been adamant that the answer is "yes", while I'm a little less certain, but am willing to defer to the AP team on this. (It's not that I expect the AP pipeline to be able to do much with those partial outputs, but failing a little later when a downstream task doesn't find what it needs - as DRP often would in these cases - seems fine.)

I'll make a community post before I merge this in any case; there is some chance we'll want to do a bit of work on some downstream tasks as we get used to the full implications of this RFC.

@parejkoj
Copy link
Contributor

Maybe better to leave this for discussion on Community, but I do wonder whether the default for this should be to fail on partial output errors, with DRP changing it in their pipelines? I don't know what we should train the users to expect, when running pipelines themselves: "reproducible errors halt" or "reproducible errors continue and may or may not cause problems further down the line"?

@TallJimbo
Copy link
Member Author

I do wonder whether the default for this should be to fail on partial output errors, with DRP changing it in their pipelines? I don't know what we should train the users to expect, when running pipelines themselves: "reproducible errors halt" or "reproducible errors continue and may or may not cause problems further down the line"?

The vast majority of reproducible errors still halt, regardless of the new option - it's just the ones that also have a potentially-useful partial output that now default to proceeding. I think that's consistent with us wanting to process as far as we can in a lot of contexts (at least Rapid Analysis as well as DRP, and I think it's plausible alert production in special programs could go this direction, too, even if it doesn't in WFD). This does put the burden on the downstream tasks to handle those partial outputs, and I do suspect some of them are not ready for that responsibility yet (in particular, we'll get some confusing error messages from them at first). But I also think that's the only place that responsibility can realistically be.

This reverts commit 7b1fce0.

This logging was duplicative when seen from STDERR, which also gets
exception tracebacks that propagate up, but we need error messages and
tracebacks to appear in the saved logs, too, and this is the only
place we can do that.
@TallJimbo TallJimbo merged commit b13e38d into main Aug 28, 2024
14 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-44488 branch August 28, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants