Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal handling in main process changed after forking a new worker process #33

Closed
prajnoha opened this issue Dec 17, 2019 · 31 comments
Closed
Labels
bug Something isn't working

Comments

@prajnoha
Copy link
Member

prajnoha commented Dec 17, 2019

Normally, SID daemon reacts to SIGTERM and it shuts itself down gracefully upon reception of this signal. However, if SID daemon forks at least one worker process (and that one can finish completely), the signal handler is not working correctly further - there's no reaction, the signal handler is not called at all.

This seems to be related to the code in worker_control_get_new_worker where a new worker process is forked from daemon. After forking, in the worker process, we separate several resources we also need in the worker and we attach all of these inherited resources to top-level "worker" resource we've just created. Then we destroy the rest of unneeded resources in the worker process, including the "sid" resource which had SIGTERM and SIGINT signals registered (and which was previously the top-level resource).

Above-mentioned sequence is executed in worker-control code right here:

https://github.com/sid-project/sid-mvp/blob/0cd4698aa6d94d4b59cdbb892c689632d8572759/src/resource/worker-control.c#L137-L142

All works as expected (SIGTERM signal handling is OK) after commenting out this line: https://github.com/sid-project/sid-mvp/blob/0cd4698aa6d94d4b59cdbb892c689632d8572759/src/resource/worker-control.c#L142

It looks that registering and then deregistering signal handlers (even though for various event loops) get messed up somehow...

Obviously, this is an issue when we're trying to stop SID daemon (...if that's systemctl stop sid.service, then the systemctl call halts because it is unable to stop the service - that will never stop as the signal is not processed).

@prajnoha
Copy link
Member Author

prajnoha commented Dec 17, 2019

Also, it's interesting that we register SIGINT exactly the same way as SIGTERM (both in sid resource in parent daemon process and worker resource in the worker process). But it's only SIGTERM handling that has issues.

@prajnoha prajnoha added the bug Something isn't working label Dec 17, 2019
@prajnoha
Copy link
Member Author

One more important note: the SIGTERM handling is not correct in the main daemon process - that's where the problem is. Thing is, the problem appears even after the worker process is already gone/exited completely and we're left only with the main daemon process and now workers running!

Instantiating the worker process only initiates the problem, but further worker existence is not needed for this problem to occur.

@trgill
Copy link
Contributor

trgill commented Jan 7, 2021

I think a systemd process should be blocking signals and using sd_event_add_signal() to get notifications (?):

https://man7.org/linux/man-pages/man3/sd_event_signal_handler_t.3.html

We should do something like:

if (sigemptyset(&sig_set) < 0) {

But update this:

signal(SIGTERM, &_shutdown_signal_handler);

to use sd_event_add_signal().

I'll make the updates and test it out.

@prajnoha
Copy link
Member Author

prajnoha commented Jan 11, 2021

I think a systemd process should be blocking signals and using sd_event_add_signal() to get notifications (?):

https://man7.org/linux/man-pages/man3/sd_event_signal_handler_t.3.html

Yes, exactly, systemd requires the signals to be blocked first before registering them with systemd.

But update this:

signal(SIGTERM, &_shutdown_signal_handler);

to use sd_event_add_signal().

Ah, sure, this should be cleaned up! However, my intention was to keep all systemd-related code in one place (one of SID's own libraries where it can wrap it). Currently, the calls to systemd lib fns is kept in resource.c (libsidresource) and log-target-journal.c (libsidlog) - just for us to be able to switch into using a different library in the future if there's an alternative for other environments (e.g. #32), making switching easier.

We can possible try to move the _become_daemon functionality inside resource code and provide that as an option somehow when creating resources.

Then, if we called:

if (!(sid_res = sid_resource_ref(sid_resource_create(SID_RESOURCE_NO_PARENT,

...we'd passed that within one of sid_resource_create args (maybe with as the sid_resource_flags_t flags one). Then making sure we become daemon only once, so any further resource creations with the "become daemon" request would fail if we're already a daemon. Just an idea, but we can clean this up in a different way... simply, the intention is to keep the systemd-related code in one place and wrapped so we can switch into an alternative if needed.

@trgill
Copy link
Contributor

trgill commented Jan 12, 2021

The forking-and-exiting only makes sense if we start the daemon manually from a terminal and not with systemd, right? I see the sid.service file starts it in the foreground.

Currently the parent hits the select() and gets the SIGTERM from the child and exits without handling the signal. This leaves a defunct process until the child exits.

For the child, we'd like it to field the SIGTERM and exit the event loop, right? I think we need to make some updates for that to work.

This is probably a low priority issue, right? I picked it somewhat randomly. I'll look through the other issues to see where I might be able to help. Please let me know if you have any suggestions.

@prajnoha
Copy link
Member Author

The forking-and-exiting only makes sense if we start the daemon manually from a terminal and not with systemd, right? I see the sid.service file starts it in the foreground.

Yes, indeed, if systemd is used, it's not necessary to daemonize the process, systemd will take care of that for us...

Currently the parent hits the select() and gets the SIGTERM from the child and exits without handling the signal. This leaves a defunct process until the child exits.

For the child, we'd like it to field the SIGTERM and exit the event loop, right? I think we need to make some updates for that to work.

For now, yes, it was supposed to exit the event loop immediately. I'd like to enhance this eventually so that if SID receives SIGTERM, it will stop listening to any new requests and let all existing workers to finish before SID exits completely.

There's one important thing to note when considering signal handlers through systemd lib. If we don't specify the handler for sid_resource_create_signal_event_source (and so the underlying sd_event_add_signal call), the default operation executed by systemd lib is to exit its event loop. If the custom handler is defined, systemd lib lets us handle everything, including possible exit of the event loop. In this case, we need to call sid_resource_exit_event_loop (and so the underlying sd_event_exit) to do so.

This is probably a low priority issue, right? I picked it somewhat randomly. I'll look through the other issues to see where I might be able to help. Please let me know if you have any suggestions.

Relatively low, but it was very annoying so I'm happy it's being resolved now :) Sorry, I promised to open issues here on github for all the things I've hit and have written down in my notes. Somehow I didn't get to that, but will try to do that asap.

@trgill
Copy link
Contributor

trgill commented Jan 15, 2021

Is this ok to close now?

It is fixed by : #50

@prajnoha
Copy link
Member Author

Sure! Closing...

@prajnoha
Copy link
Member Author

Hmm, interesting, after resolving #52, this still seems to be an issue even after we got rid of the direct signal registration in commit a07f331. When I try to send SIGTERM or SIGINT to main SID process after at least one worker is processed, the signal is ignored.

So It looks like this call in the worker still influences the parent process somehow:

/* destroy the unneeded and inherited "sid" resource from parent */
(void) sid_resource_unref(sid_resource_search(worker_control_res, SID_RESOURCE_SEARCH_TOP, NULL, NULL));

@prajnoha prajnoha reopened this Jan 18, 2021
@trgill
Copy link
Contributor

trgill commented Jan 20, 2021

Yeah, after the following in the log:

150701:150792]:resource.c:111:_destroy_event_source    <worker/0bc7df7b-6259-43fa-89e5-cb84043c6185> Event source removed: sigterm.
[150701:150792]:resource.c:111:_destroy_event_source    <worker/0bc7df7b-6259-43fa-89e5-cb84043c6185> Event source removed: sigint.
[150701:150792]:resource.c:111:_destroy_event_source    <worker/0bc7df7b-6259-43fa-89e5-cb84043c6185> Event source removed: main.

The parent process no longer responds to the signals.

I can't find documentation that confirms that is how systemd is designed to work. I'll track down a mailing list or IRC channel to confirm what we are seeing is working as intended.

@prajnoha
Copy link
Member Author

I see that systemd doesn't want users to use the event loop after forking, this is a check they call on each sd-event API call:

https://github.com/systemd/systemd/blob/23afa884d4f3bcd97160a893816f9ba170f62ad4/src/libsystemd/sd-event/sd-event.c#L430-L437

We should be still OK though as we're not using the event loop from the parent in any children anymore (we create a new one in the worker).

That's also the reason I'm using unref in sid_resource_destroy:

sid/src/resource/resource.c

Lines 309 to 310 in 99b185d

if (res->sd_event_loop)
res->sd_event_loop = sd_event_unref(res->sd_event_loop);

The unref doesn't do the check. So this one could still be used for doing the cleanup in the child at least. (If we called sd_event_exit instead, that one would still try to finish current event loop iteration I think which would be wrong because this is not the original even loop anymore, it's a memory clone in the child...)

@trgill
Copy link
Contributor

trgill commented Jan 21, 2021

FYI - I added 👍

	if ((r = sd_event_source_set_destroy_callback(sd_es, _on_sid_resource_destroyed)) < 0)
			goto fail;

It does not get called in the parent.

I tried to reproduce this issue with a smaller program:

https://github.com/trgill/signal_test

The does not reproduce the problem we are seeing. I'm in the process of making it closer to what sid does.

@trgill
Copy link
Contributor

trgill commented Jan 21, 2021

With these changes:

master...trgill:signal_test

The signal is delivered to _sd_signal_event_handler() after the child processes have completed. The data variable isn't set right because I commented out the _create_event_source(). I think this is a clue about what is going wrong.

I'm looking through _on_worker_proxy_child_event() which is called in the parent process. I'm wondering if there is a problem in the worker proxy. It isn't clear to me how the worker proxy lifetime is designed to work yet.

@prajnoha
Copy link
Member Author

prajnoha commented Jan 22, 2021

With these changes:

master...trgill:signal_test

The signal is delivered to _sd_signal_event_handler() after the child processes have completed. The data variable isn't set right because I commented out the _create_event_source(). I think this is a clue about what is going wrong.

Odd... I'm just looking at the code in the _create_event_source, but I don't see anything suspicious at first sight - that code is just wrapping the sd_event_source *sd_es, setting data and description, but nothing unusual... OK, will look around that code too and I'll see if I can spot something...

I'm looking through _on_worker_proxy_child_event() which is called in the parent process. I'm wondering if there is a problem in the worker proxy. It isn't clear to me how the worker proxy lifetime is yet.

The worker_proxy exists as long as the worker exists - the proxy is just a representation on parent side so we can track workers there, their state and have a communication endpoint represented as a resource... maybe collecting some statistic in the future if we used workers for longer-running commands and/or have a way/start recycling workers from a pool.

Anyway, once the worker is finished, the worker_proxy should be destroyed too - once the proxy receives signal from the child that has just exited, it destroys itself here:

(void) sid_resource_destroy(worker_proxy_res);

@trgill
Copy link
Contributor

trgill commented Jan 22, 2021

In _create_event_source() - we want to add the new_es->list or new_es?

-	list_add(&res->event_sources, &new_es->list);
+	list_add(&res->event_sources, &new_es);

@prajnoha
Copy link
Member Author

prajnoha commented Jan 22, 2021

In _create_event_source() - we want to add the new_es->list or new_es?

The list_add(&res->event_sources, &new_es->list); is OK, but as you pointed out on irc, we have list_init missing for the list. Sorry, got confused a bit, this is a list header not a list itself so the list_init is not used there, the list_add call in that function will initialize this structure (adding the list header to a list).

@trgill
Copy link
Contributor

trgill commented Jan 22, 2021

But we need to initialize the next and previous pointers, right?

	new_es->list.n = new_es;
	new_es->list.p = new_es;

@trgill
Copy link
Contributor

trgill commented Jan 22, 2021

Nevermind - I see.

@prajnoha
Copy link
Member Author

Well, it really looks like that the child process is fiddling with the parent's signal registrations!

When I try to run this under gdb, then tracing the worker process calling this line:

/* destroy the unneeded and inherited "sid" resource from parent */
(void) sid_resource_unref(sid_resource_search(worker_control_res, SID_RESOURCE_SEARCH_TOP, NULL, NULL));

...which in turn calls sid_resource_destroy for the inherited sid resource that had the SIGTERM and SIGINT (and other) signal handlers registered in parent, I get to this backtrace (including systemd's sd-event library):

event_unmask_signal_data (e=0x4059e0, d=0x406000, sig=15) at	../src/libsystemd/sd-event/sd-event.c:713
#1  0x00007ffff7d7d462 in event_gc_signal_data (e=0x4059e0, priority=0x405f00, sig=15) at ../src/libsystemd/sd-event/sd-event.c:748
#2  0x00007ffff7d7da4e in source_disconnect (s=0x405ed0) at ../src/libsystemd/sd-event/sd-event.c:849
#3  0x00007ffff7d7e033 in source_free (s=0x405ed0) at ../src/libsystemd/sd-event/sd-event.c:945
#4  0x00007ffff7d82737 in event_source_free (s=0x405ed0) at ../src/libsystemd/sd-event/sd-event.c:2072
#5  0x00007ffff7d82834 in sd_event_source_unref	(p=0x405ed0) at	../src/libsystemd/sd-event/sd-event.c:2077
#6  0x00007ffff7fa98a9 in _destroy_event_source	(es=0x406100) at resource.c:115
#7  0x00007ffff7fa95ed in sid_resource_destroy (res=0x405900) at resource.c:306
#8  0x00007ffff7fa998e in sid_resource_unref (res=0x405900) at resource.c:352
#9  0x00007ffff7fb9c2d in worker_control_get_new_worker	(worker_control_res=0x406b50, params=0x7fffffffdce0) at	worker-control.c:710
#10 0x00007ffff7fb54bd in _on_ubridge_interface_event (es=0x440680, fd=11, revents=1, data=0x4068a0) at	ubridge.c:3927
#11 0x00007ffff7fa9d6a in _sd_io_event_handler (sd_es=0x425220,	fd=11, revents=1, data=0x440680) at resource.c:422
#12 0x00007ffff7d889a2 in source_dispatch (s=0x425220) at ../src/libsystemd/sd-event/sd-event.c:3490
#13 0x00007ffff7d8a569 in sd_event_dispatch (e=0x4059e0) at ../src/libsystemd/sd-event/sd-event.c:3942
#14 0x00007ffff7d8aa55 in sd_event_run (e=0x4059e0, timeout=18446744073709551615) at ../src/libsystemd/sd-event/sd-event.c:4003
#15 0x00007ffff7d8ac3a in sd_event_loop	(e=0x4059e0) at	../src/libsystemd/sd-event/sd-event.c:4024
#16 0x00007ffff7fab2ae in sid_resource_run_event_loop (res=0x405900) at	resource.c:880
#17 0x000000000040147c in main (argc=3,	argv=0x7fffffffe1b8) at	sid.c:186

So we're at event_unmask_signal_data in sd-event. If I look at that code exactly:

https://github.com/systemd/systemd/blob/77c93e0a4eddd66a8a4be02d80ffca365b19b767/src/libsystemd/sd-event/sd-event.c#L713

There, sd-event code calls the signalfd to modify the signal handling and here, the d->fd it uses is 9.

Now, right at this point, if I look at sid main process and worker process and their FDs, I see (PID 79268 is sid main process, PID 79295 is its worker process):

/proc/79268/fd 
❯ ls -l
total 0
lrwx------. 1 root root 64 Jan 25 22:21 0 -> /dev/pts/5
lrwx------. 1 root root 64 Jan 25 22:21 1 -> /dev/pts/5
lrwx------. 1 root root 64 Jan 25 22:21 10 -> /dev/mapper/control
lrwx------. 1 root root 64 Jan 25 22:21 11 -> 'socket:[113033]'
lrwx------. 1 root root 64 Jan 25 22:21 12 -> 'socket:[113034]'
lrwx------. 1 root root 64 Jan 25 22:21 13 -> 'socket:[113036]'
lrwx------. 1 root root 64 Jan 25 22:21 14 -> 'anon_inode:[pidfd]'
lrwx------. 1 root root 64 Jan 25 22:21 2 -> /dev/pts/5
l-wx------. 1 root root 64 Jan 25 22:21 3 -> /root/.cgdb/logs/cgdb_log1.txt
l-wx------. 1 root root 64 Jan 25 22:21 4 -> /root/.cgdb/logs/cgdb_gdb_io_log1.txt
lrwx------. 1 root root 64 Jan 25 22:21 5 -> /dev/ptmx
lrwx------. 1 root root 64 Jan 25 22:21 6 -> /dev/pts/3
lr-x------. 1 root root 64 Jan 25 22:21 7 -> /dev/pts/4
lrwx------. 1 root root 64 Jan 25 22:21 8 -> 'anon_inode:[eventpoll]'
lrwx------. 1 root root 64 Jan 25 22:21 9 -> 'anon_inode:[signalfd]'
/proc/79295/fd 
❯ ls -l
total 0
lrwx------. 1 root root 64 Jan 25 22:21 0 -> /dev/pts/5
lrwx------. 1 root root 64 Jan 25 22:40 1 -> /dev/pts/5
lrwx------. 1 root root 64 Jan 25 22:40 10 -> /dev/mapper/control
lrwx------. 1 root root 64 Jan 25 22:40 13 -> 'anon_inode:[eventpoll]'
lrwx------. 1 root root 64 Jan 25 22:40 14 -> 'socket:[113037]'
lrwx------. 1 root root 64 Jan 25 22:40 15 -> 'anon_inode:[signalfd]'
lrwx------. 1 root root 64 Jan 25 22:40 2 -> /dev/pts/5
l-wx------. 1 root root 64 Jan 25 22:40 3 -> /root/.cgdb/logs/cgdb_log1.txt
l-wx------. 1 root root 64 Jan 25 22:40 4 -> /root/.cgdb/logs/cgdb_gdb_io_log1.txt
lrwx------. 1 root root 64 Jan 25 22:40 5 -> /dev/ptmx
lrwx------. 1 root root 64 Jan 25 22:40 6 -> /dev/pts/3
lr-x------. 1 root root 64 Jan 25 22:40 7 -> /dev/pts/4
lrwx------. 1 root root 64 Jan 25 22:40 8 -> 'anon_inode:[eventpoll]'
lrwx------. 1 root root 64 Jan 25 22:40 9 -> 'anon_inode:[signalfd]'

In the worker, there are 2 signalfds - one (presumably FD 9) is the inherited one, the other with FD 15 is a new one created in the worker for the worker's event loop (I suppose systemd creates one signalfd per event loop if signal events are registered).

So I think we need to avoid this interference somehow... not sure yet how right at this moment, but we need to figure something out here if we want to do the proper cleanup in the worker and get rid of unused inherited stuff.

@trgill
Copy link
Contributor

trgill commented Jan 27, 2021

Thanks for looking at this - I appreciate the help.

I haven't been able to reproduce the behaviour with a simple test program; https://github.com/trgill/signal_test - I'm still looking to see why it is different.

Please let me know what you think about avoiding the problem. Have you considered a thread pool rather than using fork()?

@prajnoha
Copy link
Member Author

I haven't been able to reproduce the behaviour with a simple test program; https://github.com/trgill/signal_test - I'm still looking to see why it is different.

...yes, that adds a bit of mystery to it :) I'll also try to look once again whether there's anything we're missing...

Please let me know what you think about avoiding the problem. Have you considered a thread pool rather than using fork()?

Honestly, I don't have a nice cure for this right at this moment - removing event sources completely before fork just for the child to not inherit what it's not supposed to and then after fork reinstate the event sources is something I'd really like to avoid. But there's also sd_event_source_get_enabled to temporarily enable/disable the event source - I'll try to check if that could help us a bit or not.

I'd like to keep the forking in there because that keeps the separate process for each run where module code is also executed and then there's the COW memory feature after forking we use - so we can use a snapshot of the in-memory database easily. With threads, this would require a completely different scheme, including more locking....

@trgill
Copy link
Contributor

trgill commented Feb 3, 2021

Sorry if I'm missing something obvious... just have a question.

When I set a breakpoint on:

r = sid_resource_run_event_loop(res);

I get a call stack of:

libsidresource.so.0!worker_control_get_new_worker(sid_resource_t * worker_control_res, struct worker_params * params) (/home/todd/workspace/sid/src/resource/worker-control.c:792)
libsidresource.so.0!_on_ubridge_interface_event(sid_resource_event_source_t * es, int fd, uint32_t revents, void * data) (/home/todd/workspace/sid/src/resource/ubridge.c:3927)
libsidresource.so.0!_sd_io_event_handler(sd_event_source * sd_es, int fd, uint32_t revents, void * data) (/home/todd/workspace/sid/src/resource/resource.c:424)
libsystemd.so.0!source_dispatch(sd_event_source * s) (/home/todd/workspace/systemd/src/libsystemd/sd-event/sd-event.c:3498)
libsystemd.so.0!sd_event_dispatch(sd_event * e) (/home/todd/workspace/systemd/src/libsystemd/sd-event/sd-event.c:3950)
libsystemd.so.0!sd_event_run(sd_event * e, uint64_t timeout) (/home/todd/workspace/systemd/src/libsystemd/sd-event/sd-event.c:4011)
libsystemd.so.0!sd_event_loop(sd_event * e) (/home/todd/workspace/systemd/src/libsystemd/sd-event/sd-event.c:4032)
libsidresource.so.0!sid_resource_run_event_loop(sid_resource_t * res) (/home/todd/workspace/sid/src/resource/resource.c:882)
main(int argc, char ** argv) (/home/todd/workspace/sid/src/daemon/sid.c:186)

The new fork()'d (worker proxy) copy of the main process starts another main loop in the event dispatch of the top level event loop?

The there is 1 worker proxy per worker process? The worker proxy collects the results from the worker and report back to the main process/database?

@prajnoha
Copy link
Member Author

prajnoha commented Feb 4, 2021

The new fork()'d (worker proxy) copy of the main process starts another main loop in the event dispatch of the top level event loop?

The worker_proxy is on parent side, the worker is on child process side. The new event loop is started only in the worker resource, that is on child process side - the old event loop that got inherited from parent should be destroyed.

Now, if you're asking whether we call fork() in the event dispatch of the main event loop of the parent process - YES. And that's something we can probably discuss further whether it's OK to do it this way or not. This is indeed something I wanted to review and it is a painful spot.

Some notes I have regarding this topic:

  • If we call fork() buried somewhere deep inside the event dispatch call stack, then that call stack gets inherited in the child process. That's what we do right now.

  • If we wanted to start the fork() completely without any existing call stack (out of event loop dispatch), we'd need a different process that the main process with the main event loop contacts and the only purpose of this second process would be to act as a template to create new workers. It's extra complexity though.

  • Another option could be to completely exit the main event loop on each request that needs to create a worker process. Then we'd create the new process and after that, we'd start the main event loop again. But I guess it would get complicated to avoid race conditions where the worker could send results before we restart the main event loop in main process etc.

The there is 1 worker proxy per worker process? The worker proxy collects the results from the worker and report back to the main process/database?

Exactly. The worker_proxy is the worker representative on parent side, while the worker resource is directly the top-level resource used in the worker process. There's 1:1 mapping between the two.

The main process actually consists of these resources (where necessary to avoid ambiguity, I inserted identifier in parenthesis):


sid
  |_ubridge
          |_aggregate(internal)
                              |_kv_store
                              |
                              |_worker_control
                              |              |_worker_proxy(id1)
                              |              |_worker_proxy(id2)
                              |              |_...
                              |
                              |_aggregate(modules)
                                                 |_module_registry(type)
                                                 |                     |_module(id1)
                                                 |                     |_module(id2)
                                                 |                     ...
                                                 |
                                                 |_module_registry(block)
                                                                        |_module(id1)
                                                                        |_module(id2)
                                                                        ...                 

The sid is top level resource that represents the daemon itself. The ubridge resource is responsible for keeping the main database (kv_store resource), controlling and monitoring worker processes (worker_control resource), preloading modules (the aggregate(modules) resource with two module_registry subresources - one for type-based modules (executed only for certain device types) and block-based modules (executed for all device types).

When a new worker is created, it has this resource tree:

worker
     |_kv_store
     |
     |_aggregate(modules)
                        |_module_registry(type)
                        |                     |_module(id1)
                        |                     |_module(id2)
                        |                     ...
                        | 
                        |_module_registry(block)
                                               |_module(id1)
                                               |_module(id2)
                                               ...

That means, we reuse the inherited kv_store and aggregate(modules) resource with their subtrees. This is isolated and attached to worker resource in _worker_init_fn function that is coded in ubridge.c file.

Then the rest is supposed to be destroyed - that is the sid, ubridge resource (with its remaining internal and worker_control subresources).

The problem we have is with destroying signalfd that is used by sd-event library to get notified about signal events (the signals we registered through sd-event in sid resource).

I was playing with this a bit more and finally I managed to reproduced, I'll attach my reproducers in a while...

@prajnoha
Copy link
Member Author

prajnoha commented Feb 4, 2021

I can reproduce the problem with the code attached below - it's pure epoll with signalfd where SIGINT, SIGTERM, SIGUSR1 is registered (epoll is used by sd-event library inside). The INT and TERM terminate the program, the USR1 causes the process to call fork.

In the forked process, I remove the INT and USR1 signals and only keep the TERM signal by calling its own signalfd - but this one sets the parent :) That means, that after the child has called the signalfd, the parent doesn't react to INT and USR1 signals, only to TERM signal.

The man 2 signalfd says only this about fork():

 fork(2) semantics
       After a fork(2), the child inherits a copy of the signalfd file descriptor.  A read(2) from the file descriptor in the child will return information about signals queued to the child.

However, it doesn't say a word about the write behavior! Which is, as I found out, that the child will change the parent's signal notifications.

This is the code for the reproducer (I've also tried your code with sd-event, though I changed a bit and it's reproducible with that one too):

#include <sys/epoll.h>
#include <sys/signalfd.h>
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define handle_error(msg) \
   do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define MAX_EVENTS 1

int main(int argc, char *argv[])
{
   sigset_t mask;
   int sfd;
   struct signalfd_siginfo fdsi;
   ssize_t s;
   pid_t pid;

   int epoll_fd;
   struct epoll_event event;
   struct epoll_event events[MAX_EVENTS];
   int event_count;

   sigemptyset(&mask);
   sigaddset(&mask, SIGINT);
   sigaddset(&mask, SIGTERM);
   sigaddset(&mask, SIGUSR1);

   if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
       handle_error("sigprocmask");

   sfd = signalfd(-1, &mask, 0);
   if (sfd == -1)
       handle_error("signalfd");

   epoll_fd = epoll_create1(0);
   if (epoll_fd == -1)
           handle_error("epoll_create");

   event.events = EPOLLIN;
   event.data.fd = sfd;

   if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sfd, &event))
           handle_error("epoll_ctl");

   while (1) {
           printf("running epoll_wait in parent with PID %d...\n", getpid());
           event_count = epoll_wait(epoll_fd, events, MAX_EVENTS, 3000);
           printf("epoll_wait returned %d event count\n", event_count);

           for (int i = 0; i < event_count; i++) {
                   if (events[i].data.fd == sfd) {
                       s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));

                       if (s != sizeof(struct signalfd_siginfo))
                           handle_error("read");

                       if (fdsi.ssi_signo == SIGINT) {
                           printf("Got SIGINT, exiting...\n");
                           exit(EXIT_SUCCESS);
                       }

                       else if (fdsi.ssi_signo == SIGTERM) {
                           printf("Got SIGTERM, exiting...\n");
                           exit(EXIT_SUCCESS);
                       }

                       else if (fdsi.ssi_signo == SIGUSR1) {
                           printf("Got SIGUSR1, calling fork...\n");

                           pid = fork();
                           if (pid < 0)
                                   handle_error("fork");

                           if (pid == 0) {
                                   /* 
                                    * Use empty mask to clear all signal
                                    * handling inherited from parent, just keep the SIGTERM.
                                    */
                                   sigemptyset(&mask);
                                   sigaddset(&mask, SIGTERM);
                                   
                                   if (signalfd(sfd, &mask, 0) < 0)
                                           handle_error("signalfd(child)");

                                   printf("exiting child with pid %d...", getpid());
                                   exit(EXIT_SUCCESS);
                           }
                       }
                       
                       else {
                           printf("Read unexpected signal\n");
                       }
                   }
           }
   }
}

@prajnoha
Copy link
Member Author

prajnoha commented Feb 4, 2021

This means sd-event library is not quite usable for our needs (creating worker processes) due to the behavior of the signalfd it uses. For us to do proper cleanup, we'd need access to the FD as returned from signalfd in parent process and close it in the child process(? ...I havent tried yet what happens if I close the signalfd in the child) But sd-event abstracts this away and it doesn't have such API hooks to manipulate the signalfd FD directly or do any kind of special cleanup which would be suitable for child processes.

@bmarzins
Copy link
Member

bmarzins commented Feb 5, 2021

Looking at the kernel code, it is clear that this is how signalfd is meant to work. Just like with a normal fd, the signalfd is copied on fork, but the underlying file is the same, which includes the data that's connected to that file with the signal mask. When the file is read, the kernel checks the pid, and returns the pending signals for that process, but there is still only one set of signal data associated with the file, and changing it from any process changes it for every process.

I don't see any good way of using systemd's signal event source, since as you mentioned, there doesn't appear to be any way to clean it up without changing the signal mask of the parent. systemd is designed to not change the signal mask when you clear all the signals from the signalfd mask, but since there are still other signals being handled by signalfd when you start to remove the signal event sources, it changes the mask. Fortunately systemd has special code for handling cleaning up SIGCHLD signal, so that it doesn't change the mask for SIGCHLD in the parent when the child destroys a child event source.

So as far as I can tell, the two easiest ways to fix this are:

  1. The "Self-Pipe Trick" (you can google it). Basically, you make your signal handlers write to a non-blocking pipe, and then add the read end of the pipe as an IO event source. This does mean that signals will now cause functions to fail with EINTR, but the code should probably already be able to deal with that.
  2. Simply create our own signalfd to handle the singals, and add it as an IO event source. We can then close this while cleaning up the sid resource in the worker, without updating the signal mask.

The child will need to reset up the signal IO source, whichever we used. In general, any resource with it's own event loop probably needs a file descriptor for signals, either a signalfd, or an end of a pipe.

Peter, does that make sense? Do you have any thoughts about how you'd like this solved?

@prajnoha
Copy link
Member Author

prajnoha commented Feb 5, 2021

Looking at the kernel code, it is clear that this is how signalfd is meant to work. Just like with a normal fd, the signalfd is copied on fork, but the underlying file is the same, which includes the data that's connected to that file with the signal mask. When the file is read, the kernel checks the pid, and returns the pending signals for that process, but there is still only one set of signal data associated with the file, and changing it from any process changes it for every process.

Yeah, looking at this from the perspective of FD, it makes sense... it's not special-cased...

So as far as I can tell, the two easiest ways to fix this are:

  1. Simply create our own signalfd to handle the singals, and add it as an IO event source. We can then close this while cleaning up the sid resource in the worker, without updating the signal mask.

Good point! :) This option looks neat and straightforward. I'd go with this one...

The child will need to reset up the signal IO source, whichever we used. In general, any resource with it's own event loop probably needs a file descriptor for signals, either a signalfd, or an end of a pipe.

Yup!

I'd patch the sid_resource_create_signal_event_source here for it to use its own signalfd instead of calling out to sd-event's sd_event_add_signal. This way the rest of the code calling the sid_resource_create/destroy_signal_event_source could stay as it is without any changes.

I think we can put the signalfd FD together with the sd_event_loop into a single substructure within struct sid_resource so it's clear that those two are closely related, like this:

typedef struct sid_resource {
  ...
  struct {
     sd_event *sd_event_loop;
     int signalfd;
  } event_loop;
  ...
} sid_resource_t

There's also pid_created recorded inside struct sid_resource, so we can probably use this one to detect that we're now in the child process, not the original parent process anymore and so we can make the sid_resource_add/destroy_signal_event_source to behave based on that to not mess up parent signals.

Then, though, there's a question related to this - if the parent didn't have the signalfd used (or it was closed) and we create a new one in the child process, then the pid_created is not quite useful for that check then... So this needs some fine tuning still, maybe adding some restrictions in a way that we'd require the event loop and signalfd to be set afresh in child processes(?)... but I think we're on a good way here for a solution.

If we couldn't find a way to do the proper cleaning automatically and let the resource code decide on its own we could simply add an API function to resource.h to explicitly call the cleanup in the child. But would be really great if we could tackle this automatically and keep this inside resource code and we wouldn't need the rest of the code to bother about any extra calls just for child processes...

@prajnoha
Copy link
Member Author

prajnoha commented Feb 5, 2021

(Hmm, just one question that comes to my mind is what happens when I create more than one singalfd FD with different masks? Signals are global per process, so I'm curious what happens here. I haven't tried that yet...)

@prajnoha prajnoha changed the title SIGTERM signal not handled correctly in daemon after forking a worker process Signal handling in main process changed after forking a new worker process Feb 5, 2021
@prajnoha
Copy link
Member Author

prajnoha commented Feb 5, 2021

I don't see any good way of using systemd's signal event source, since as you mentioned, there doesn't appear to be any way to clean it up without changing the signal mask of the parent. systemd is designed to not change the signal mask when you clear all the signals from the signalfd mask, but since there are still other signals being handled by signalfd when you start to remove the signal event sources, it changes the mask.

....well, yes, right, so it's a bit like "destroy all or nothing" for signals event sources then - looks we can't (or would be harder) to still keep the individual sid_resource_destroy_signal_event_source calls, we'd need to shut down the whole signalfd in the child in one go. So looks like that instead of looping through the event sources and destroying them one by one here:

sid/src/resource/resource.c

Lines 305 to 306 in 42a8157

list_iterate_items_safe(es, tmp_es, &res->event_sources)
_destroy_event_source(es);

...we need to destroy the signalfd in one go for all the signal event sources.

Anyway, sd-event loop API doesn't allow to add/remove/do anything besides unref calls in any child process where the event loop got inherited from parent, so we might as well simply disallow that too and allow only the sid_resource_destroy and a few other functions that do not change any state of the event loop (various get functions if there are any).

We may support "resource refresh" for child processes which would reinstate the event loop (== destroy old one and create a new one), but that's probably not needed right now. So patching that sid_resource_destroy to properly cleanup everything is just enough for now I think (+not allowing addition/removal of event sources after fork).

@trgill
Copy link
Contributor

trgill commented Feb 13, 2021

FYI - I put this PR together. I'm not sure it is ready, but it works.

#61

I need to re-read the thread above to make sure I got it right.

@prajnoha
Copy link
Member Author

#61 merged, we opened #63 for one more enhancement in this area, but that is not essential to solve the issue described here. Thanks Ben and Todd for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants