Skip to content

Commit

Permalink
job-exec: send SIGUSR1 to the IMP, not SIGKILL
Browse files Browse the repository at this point in the history
Problem: RFC 15 states that the IMP handles SIGUSR1 by
sending SIGKILL to the entire cgroup.

For multi-user, send the IMP SIGUSR1 rather than SIGKILL after
shell signaling mechanisms have failed to clean up.

Update test faux imp shell script used in test.
  • Loading branch information
garlick committed Nov 6, 2024
1 parent 1d636f9 commit 001097a
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 16 deletions.
14 changes: 11 additions & 3 deletions src/modules/job-exec/job-exec.c
Original file line number Diff line number Diff line change
Expand Up @@ -438,13 +438,21 @@ static void kill_shell_timer_cb (flux_reactor_t *r,
{
struct jobinfo *job = arg;
struct idset *active_ranks;
int actual_kill_signal = kill_signal;

/* RFC 15 states that the IMP handles SIGUSR1 by sending SIGKILL to
* the entire cgroup. Sending SIGKILL to the IMP is not productive.
*/
if (job->multiuser)
actual_kill_signal = SIGUSR1;

flux_log (job->h,
LOG_DEBUG,
"Sending %s to job shell for job %s",
sigutil_signame (kill_signal),
"Sending %s to %s for job %s",
sigutil_signame (actual_kill_signal),
job->multiuser ? "IMP" : "job shell",
idf58 (job->id));
(*job->impl->kill) (job, kill_signal);
(*job->impl->kill) (job, actual_kill_signal);
job->kill_shell_count++;

/* Since we've transitioned to killing the shell directly, stop the
Expand Down
10 changes: 0 additions & 10 deletions t/job-exec/imp-fail.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,6 @@ case "$cmd" in
printf "test-imp: Going to fail on rank 1\n" >&2
if test $(flux getattr rank) = 1; then exit 0; fi
exec "$@" ;;
kill)
# Note: kill must be implemented in test since job-exec
# module will run `flux-imp kill PID`.
#
signal=$2;
pid=$3;
printf "test-imp: kill -$signal $pid\n" >&2
shift 3;
printf "test-imp: Kill pid $pid signal $signal\n" >&2
kill -$signal $pid ;;
*)
printf "test-imp: Fatal: Unknown cmd=$cmd\n" >&2; exit 1 ;;
esac
Expand Down
5 changes: 2 additions & 3 deletions t/t2404-job-exec-multiuser.t
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ test_expect_success 'job-exec: reconfig and reload module' '
flux config reload &&
flux module reload -f job-exec
'
test_expect_success NO_ASAN 'job-exec: kill multiuser job uses the IMP' '
test_expect_success NO_ASAN 'job-exec: kill multiuser job works' '
FAKE_USERID=42 &&
flux run --dry-run -n2 -N2 sleep 1000 | \
flux python ${SIGN_AS} ${FAKE_USERID} > sleep-job.signed &&
Expand All @@ -91,8 +91,7 @@ test_expect_success NO_ASAN 'job-exec: kill multiuser job uses the IMP' '
jq -e ".userid == 42" < ${id}.json &&
flux job wait-event -p exec -vt 30 ${id} shell.start &&
flux cancel ${id} &&
test_expect_code 143 run_timeout 30 flux job status -v ${id} &&
flux dmesg | grep "test-imp: Kill .*signal 15"
test_expect_code 143 run_timeout 30 flux job status -v ${id}
'

# Configure failing IMP
Expand Down

0 comments on commit 001097a

Please sign in to comment.