Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update multi-user signaling to track flux-security 0.13.0 IMP changes #6408

Merged
merged 7 commits into from
Nov 6, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Nov 1, 2024

Problem: job-exec uses flux imp kill to deliver SIGKILL to the flux-shell when shell signaling methods fail to clean up a multi-user job, but the flux imp kill sub-command is being deprecated in favor of having the IMP forward signals (per RFC 15).

This changes job-exec to send SIGUSR1 (which RFC 15 defines as a proxy for SIGKILL) directly to the IMP in that case.

To make it easier to coordinate the flux-core and flux-security changes, we'll add the #6409 fix here as well.

@grondo
Copy link
Contributor

grondo commented Nov 1, 2024

This can't be merged until a flux-security v0.12 is tagged which supports SIGUSR1 in the IMP. (SIGUSR1 will cause the current IMP to terminate immediately)

@garlick
Copy link
Member Author

garlick commented Nov 1, 2024

Right, sorry, probably should've been a WIP.

@garlick garlick changed the title job-exec: send SIGUSR1 to IMP instead using imp kill SIGKILL WIP job-exec: send SIGUSR1 to IMP instead using imp kill SIGKILL Nov 1, 2024
@garlick garlick changed the title WIP job-exec: send SIGUSR1 to IMP instead using imp kill SIGKILL update multi-user signaling to track flux-security 0.12.0 IMP changes Nov 3, 2024
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I do notice a few places where flux-imp kill is still referenced:

  • flux-exec.c still has an imp_kill() function
  • flux-exec(1) mentions that it uses flux-imp kill to signal processes
  • bulk-exec still has imp_kill_ops and associated functions
  • t/job-exec/imp-fail.sh has unnecessary imp kill implementation

@garlick
Copy link
Member Author

garlick commented Nov 4, 2024

Wow, don't know how I missed all that. OK, updated, and since the flux-security 0.12 tag was just pushed, this actually has a chance of passing CI so we'll see.

@garlick
Copy link
Member Author

garlick commented Nov 5, 2024

Some t9000-system.t tests are failing where we check that flux exec --with-imp forwards signals. Here's an excerpt from one failing test that uses the IMP to run a shell script that calls sleep:

sending signal 2 to 1 running processes
sudo timed out after 30.0s
test_expect_code: command exited with 137, we wanted 130 run_timeout 30 sudo -u flux ./test_signal.sh INT
not ok 8 - 0002-exec-with-imp.t: flux exec --with-imp forwards signals

I can recreate this environment this on my test system by creating a shell script that runs sleep, configuring it in the IMP, and manually running flux exec --with-imp, and sure enough, SIGINT to flux-exec appears to have no effect, and sending a SIGTERM kills the shell and the IMP but leaves the sleep and doesn't terminate.

I did observe:

  • manually sending SIGINT to the shell (script) also has no effect
  • manually sending SIGTERM to the shell (script) causes it to terminate but orphans the sleep command, which in turn prevented the subprocess server from finalizing the exec stream with ENODATA because the stdout/stderr streams were still open.

When I run the same script directly with flux exec (no IMP), those signals cause everything to wrap up as expected.

Still pondering this one - just wanted to post an update!

@garlick
Copy link
Member Author

garlick commented Nov 5, 2024

Possibly the IMP's internal fwd_signal() needs to behave differently when the target is not a shell. Currently it treats SIGUSR1 (surrogate for SIGKILL) specially and assumes it (the IMP) is responsible for cleaning up its child and grandchildren. Other signals are delegated to the direct child for distribution, which sounds right for flux-shell but perhaps not for a shell script. Maybe it needs a flag so when it is called from imp run it assumes it is cleaning up everything for any signal.

Another observation is that flux_subprocess_kill() by default uses killpg() while the IMP uses kill() when signaling the direct child.

@garlick
Copy link
Member Author

garlick commented Nov 5, 2024

After chatting with @grondo, I opened flux-framework/flux-security#194 to change the signal forwarding behavior of imp run.

@garlick garlick changed the title update multi-user signaling to track flux-security 0.12.0 IMP changes update multi-user signaling to track flux-security 0.13.0 IMP changes Nov 6, 2024
@garlick
Copy link
Member Author

garlick commented Nov 6, 2024

Updated to require flux-security 0.13, which should address the test failure. 🤞

@garlick
Copy link
Member Author

garlick commented Nov 6, 2024

I'll set MWP on this one too.

Problem: flux core now requires the IMP signal forwarding features
of flux-security 0.13.0, but configure only checks for >= 0.9.0.

Modify configure to require >= 0.13.0.
Problem: flux core now requires the IMP signal forwarding features
of flux-security 0.13.0, but CI specifies requires 0.11.0.

Modify docker-run-checks.sh to require the newer version.
Problem: job-exec uses 'flux imp kill' to deliver signals to multi-user
jobs but that command is deprecated.

Don't call bulk_exec_set_imp_path().
Problem: RFC 15 states that the IMP handles SIGUSR1 by
sending SIGKILL to the entire cgroup.

For multi-user, send the IMP SIGUSR1 rather than SIGKILL after
shell signaling mechanisms have failed to clean up.

Update test faux imp shell script used in test.
Problem: housekeeping and perilog use 'flux imp kill' to send
signals to housekeeping and prolog/epilog processes, but the
IMP will now forward signals and 'flux imp kill' is deprecated.

Don't call bulk_exec_set_imp_path() in housekeeping and perilog.

Fixes flux-framework#6409
Problem: flux-exec uses 'flux imp kill' to send signals to remote
process if the IMP started them, but the IMP will now forward signals
and 'flux imp kill' is deprecated.

Don't call 'flux imp kill'.

When the IMP starts a process, translate SIGKILL to SIGUSR1 per RFC 15.

Drop note about 'flux imp kill' from the flux-exec(1) man page.
Problem: bulk_exec_set_imp_path() and associated code no longer has
any users.

Remove it.

Update unit test.
@mergify mergify bot merged commit 0afe5a3 into flux-framework:master Nov 6, 2024
32 checks passed
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83.62%. Comparing base (9411280) to head (f6cf51d).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
src/cmd/flux-exec.c 71.42% 2 Missing ⚠️
src/modules/job-exec/job-exec.c 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6408       +/-   ##
===========================================
+ Coverage   53.96%   83.62%   +29.65%     
===========================================
  Files         476      524       +48     
  Lines       80288    87603     +7315     
===========================================
+ Hits        43328    73254    +29926     
+ Misses      36960    14349    -22611     
Files with missing lines Coverage Δ
src/common/libsubprocess/bulk-exec.c 83.42% <100.00%> (+22.29%) ⬆️
src/modules/job-exec/exec.c 81.84% <ø> (+34.22%) ⬆️
src/modules/job-manager/housekeeping.c 82.67% <ø> (+68.39%) ⬆️
src/modules/job-manager/plugins/perilog.c 82.72% <ø> (+40.90%) ⬆️
src/modules/job-exec/job-exec.c 77.01% <80.00%> (+36.25%) ⬆️
src/cmd/flux-exec.c 81.69% <71.42%> (+19.49%) ⬆️

... and 441 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants