-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update multi-user signaling to track flux-security 0.13.0 IMP changes #6408
Conversation
This can't be merged until a flux-security v0.12 is tagged which supports SIGUSR1 in the IMP. (SIGUSR1 will cause the current IMP to terminate immediately) |
Right, sorry, probably should've been a WIP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I do notice a few places where flux-imp kill
is still referenced:
flux-exec.c
still has animp_kill()
functionflux-exec(1)
mentions that it usesflux-imp kill
to signal processes- bulk-exec still has
imp_kill_ops
and associated functions t/job-exec/imp-fail.sh
has unnecessaryimp kill
implementation
Wow, don't know how I missed all that. OK, updated, and since the flux-security 0.12 tag was just pushed, this actually has a chance of passing CI so we'll see. |
Some
I can recreate this environment this on my test system by creating a shell script that runs sleep, configuring it in the IMP, and manually running I did observe:
When I run the same script directly with Still pondering this one - just wanted to post an update! |
Possibly the IMP's internal Another observation is that |
After chatting with @grondo, I opened flux-framework/flux-security#194 to change the signal forwarding behavior of |
Updated to require flux-security 0.13, which should address the test failure. 🤞 |
I'll set MWP on this one too. |
Problem: flux core now requires the IMP signal forwarding features of flux-security 0.13.0, but configure only checks for >= 0.9.0. Modify configure to require >= 0.13.0.
Problem: flux core now requires the IMP signal forwarding features of flux-security 0.13.0, but CI specifies requires 0.11.0. Modify docker-run-checks.sh to require the newer version.
Problem: job-exec uses 'flux imp kill' to deliver signals to multi-user jobs but that command is deprecated. Don't call bulk_exec_set_imp_path().
Problem: RFC 15 states that the IMP handles SIGUSR1 by sending SIGKILL to the entire cgroup. For multi-user, send the IMP SIGUSR1 rather than SIGKILL after shell signaling mechanisms have failed to clean up. Update test faux imp shell script used in test.
Problem: housekeeping and perilog use 'flux imp kill' to send signals to housekeeping and prolog/epilog processes, but the IMP will now forward signals and 'flux imp kill' is deprecated. Don't call bulk_exec_set_imp_path() in housekeeping and perilog. Fixes flux-framework#6409
Problem: flux-exec uses 'flux imp kill' to send signals to remote process if the IMP started them, but the IMP will now forward signals and 'flux imp kill' is deprecated. Don't call 'flux imp kill'. When the IMP starts a process, translate SIGKILL to SIGUSR1 per RFC 15. Drop note about 'flux imp kill' from the flux-exec(1) man page.
Problem: bulk_exec_set_imp_path() and associated code no longer has any users. Remove it. Update unit test.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6408 +/- ##
===========================================
+ Coverage 53.96% 83.62% +29.65%
===========================================
Files 476 524 +48
Lines 80288 87603 +7315
===========================================
+ Hits 43328 73254 +29926
+ Misses 36960 14349 -22611
|
Problem: job-exec uses
flux imp kill
to deliver SIGKILL to the flux-shell when shell signaling methods fail to clean up a multi-user job, but theflux imp kill
sub-command is being deprecated in favor of having the IMP forward signals (per RFC 15).This changes job-exec to send SIGUSR1 (which RFC 15 defines as a proxy for SIGKILL) directly to the IMP in that case.
To make it easier to coordinate the flux-core and flux-security changes, we'll add the #6409 fix here as well.