Event loop's epoll FD inherited from main process not closed in worker #85

prajnoha · 2021-02-22T12:20:59Z

Recently, we have resolved #33 which dealt with signal handling cleanup in worker after fork - that works fine now. But there's one more thing - even though we call the cleanup in the worker right at its initialization:

sid/src/resource/ubridge.c

Lines 3950 to 3951 in 8a5f96a

    
           /* destroy the rest */ 
        
           (void) sid_resource_unref(sid_resource_search(ubridge_internal_res, SID_RESOURCE_SEARCH_TOP, NULL, NULL));

...everything is cleaned up except the inherited epoll FD backing up the event loop in main process. Quick debug shows that we really call sid_resource_destroy with sd_event_unref inside which is called:

sid/src/resource/resource.c

Lines 325 to 326 in 8a5f96a

    
           if (res->event_loop.sd_event_loop) 
        
           	res->event_loop.sd_event_loop = sd_event_unref(res->event_loop.sd_event_loop);

But checking the /proc/<worker_process_pid>/fd, it shows the epoll FD is still there...

The text was updated successfully, but these errors were encountered:

prajnoha · 2021-03-01T20:19:31Z

sd_event_unref is trivial unref function. As such, it contains a check for p->n_ref > 0 and if true, exits without calling the free_func that is event_free which is supposed to close the epoll FD:

https://github.com/systemd/systemd/blob/cd18afec16bf179205a1d4fb371af8546d3e14f3/src/basic/macro.h#L435-L446

Looking at SID run, there are still 4 references left right before we call sd_event_unref in sid_resource_destroy, so this is the reason the epoll FD is not closed:

(gdb) bt
#0  sid_resource_destroy (res=0x405900) at resource.c:316
#1  0x00007ffff7fa990e in sid_resource_unref (res=0x405900) at resource.c:363
#2  0x00007ffff7fb4cb4 in _worker_init_fn (worker_res=0x437e20, arg=0x405f70) at ubridge.c:3951
#3  0x00007ffff7fb9665 in worker_control_get_new_worker (worker_control_res=0x406220, params=0x7fffffffdd00) at worker-control.c:749
#4  0x00007ffff7fb50f5 in _on_ubridge_interface_event (es=0x43f770, fd=11, revents=1, data=0x405f70) at ubridge.c:3984
#5  0x00007ffff7fa9cda in _sd_io_event_handler (sd_es=0x4242e0, fd=11, revents=1, data=0x43f770) at resource.c:432
#6  0x00007ffff7d858c9 in source_dispatch (s=0x4242e0) at ../src/libsystemd/sd-event/sd-event.c:3490
#7  0x00007ffff7d875a4 in sd_event_dispatch (e=0x4059e0) at ../src/libsystemd/sd-event/sd-event.c:4003
#8  0x00007ffff7d87a69 in sd_event_run (e=0x4059e0, timeout=18446744073709551615) at ../src/libsystemd/sd-event/sd-event.c:4064
#9  0x00007ffff7d87c46 in sd_event_loop (e=0x4059e0) at ../src/libsystemd/sd-event/sd-event.c:4085
#10 0x00007ffff7fab43b in sid_resource_run_event_loop (res=0x405900) at resource.c:981
#11 0x0000000000401468 in main (argc=3, argv=0x7fffffffe1c8) at sid.c:178

(gdb) p res->event_loop.sd_event_loop->n_ref
$7 = 4

prajnoha · 2021-03-01T20:55:35Z

The extra references are internal references held by various sd-event parts (e.g. the source_dispatch as seen in the backtrace above). The issue here is that we're creating a new worker out of a dispatch function while these internal references are still held...

trgill · 2021-06-29T14:11:05Z

related to issue #108

bmarzins · 2022-03-08T19:25:19Z

@trgill, @prajnoha, here's my current plan for cleaning this up.

in worker_control_get_new_worker():

save prams->id, worker_proxy_channels and worker_channels to the struct worker_control before the fork
after the fork for proxies and external workers: same as currently, plus cleaning up the saved values in struct worker_control
after the fork for internal workers: set up for parent death notifications as currently and then call sd_event_exit() which will exit the event loop in sid's main() function

in sid.c:main():

check the sid_resource_run_event_loop() return for -ECHILD, since this is the return code if the loop was started by a different pid, and call a new function for children.

In new funtion.

Find the worker-control resource and grab the worker_channels and id.
create the worker resource.
call the worker-control initialization callback, which destroys all the old resources, except the ones we still need.
call sid_resource_run_event_loop()

See any issues with this?

prajnoha · 2022-03-09T10:23:41Z

@trgill, @prajnoha, here's my current plan for cleaning this up.

in worker_control_get_new_worker():

save prams->id, worker_proxy_channels and worker_channels to the struct worker_control before the fork

after the fork for proxies and external workers: same as currently, plus cleaning up the saved values in struct worker_control

after the fork for internal workers: set up for parent death notifications as currently and then call sd_event_exit() which will exit the event loop in sid's main() function

Unfortunately, the sd_event_exit call will fail with -ECHILD in the child process:

https://github.com/systemd/systemd/blob/d15e1a29e3aab04ee79d5e3ec8e1e65fca78e165/src/libsystemd/sd-event/sd-event.c#L4302-L4312

It looks that the sd-event library wasn't really designed for (at least) cleanup in new process after fork and designed for threaded environment instead. Would be great if we could do the cleanup at least - right now, I don't see a technical obstacle in doing that (it would probably just close the FDs for the epoll inside etc? But I don't have the knowledge about the implementation details of it, of course...). So I'm wondering to what degree this limitation in sd-event lib is just artificial.

prajnoha · 2022-03-09T10:44:57Z

The epoll man page does mention some notes about fork, but they all seem to be non-problematic. The fork man page doesn't mention epoll at all, so I assume the FD representing the epoll handle itself would be duplicated as usual and then it could be closed as usual in the child process. But maybe I'm missing further details why sd-event doesn't allow the cleanup in the child process.

bmarzins · 2022-03-09T16:19:03Z

Unfortunately, the sd_event_exit call will fail with -ECHILD in the child process:

https://github.com/systemd/systemd/blob/d15e1a29e3aab04ee79d5e3ec8e1e65fca78e165/src/libsystemd/sd-event/sd-event.c#L4302-L4312

It looks that the sd-event library wasn't really designed for (at least) cleanup in new process after fork and designed for threaded environment instead. Would be great if we could do the cleanup at least - right now, I don't see a technical obstacle in doing that (it would probably just close the FDs for the epoll inside etc? But I don't have the knowledge about the implementation details of it, of course...). So I'm wondering to what degree this limitation in sd-event lib is just artificial.

Oh. Thanks for the heads-up. I'm going to play around to make sure, but it just looks like we can skip calling sd_event_exit(), and have everything else work like I described above

Here's the code from sd_event_loop(), which repeatedly calls sd_event_run() until it returns an error.

https://github.com/systemd/systemd/blob/d15e1a29e3aab04ee79d5e3ec8e1e65fca78e165/src/libsystemd/sd-event/sd-event.c#L4264-L4268

And here's the code from sd_event_run(), where it returns an error when called by a different process

https://github.com/systemd/systemd/blob/d15e1a29e3aab04ee79d5e3ec8e1e65fca78e165/src/libsystemd/sd-event/sd-event.c#L4206-L4213

So, the next time sd_event_loop() checks for an event, it will exit in the child process.

bmarzins · 2022-03-09T16:34:21Z

@prajnoha, I have another question. Do we really want to support external workers? I could see supporting external modules, where the worker process executed a binary. We could have some defined format for how the binary wrote it's changes, and the worker process could take those and send them back to udev and the database. But I don't see the benefit of having a separate binary to take on all the tasks of a worker. Did you have something specific in mind for external workers?

prajnoha · 2022-03-10T08:06:19Z

I'm going to play around to make sure, but it just looks like we can skip calling sd_event_exit(), and have everything else work like I described above

...yes, we can give it a try, looks promising. Maybe the only issue could be that if we try to destroy a resource in the child (the resource cleanup we do in _worker_init_fn callback), this may cause certain parts to fail, like trying to stop/destroy registered event sources which are part of some resources - so maybe we'll need to check for -ECHILD in more places.

Now, in the worker/child, if we know which FDs belong to the channels we need for worker <--> main communication, we can probably try to close all the remaining FDs there, which would also include the inherited/duplicated FD for the sd-event loop's epoll handle and other related FDs which got duplicated.

prajnoha · 2022-03-10T08:18:02Z

@prajnoha, I have another question. Do we really want to support external workers? I could see supporting external modules, where the worker process executed a binary. We could have some defined format for how the binary wrote it's changes, and the worker process could take those and send them back to udev and the database. But I don't see the benefit of having a separate binary to take on all the tasks of a worker. Did you have something specific in mind for external workers?

I meant to generalize the "worker" to include execution of both internal and external (external binaries) code. Then on the main "control" side, the tracking has the same interface:

communication channel for external "workers" is about redirecting the output from external binary to the control side, but the interface remains the same as we'd do with the internal workers;
we have a summary of how many external workers are currently running, tracking their pids and progression the same way as we do with the internal workers;
and providing this same interface for modules to use - when they need to execute an external binary so they just create their own worker-control instance and define the channels to grab the output (or send input) and then they call out to worker-control instance to execute the binary.

In the future, I'd like to add timeouts for both internal and external workers with actions on what should be done when the timeout is reached. Also, adding follow-up actions - e.g. when execution of one worker stops, we can make another worker to start in sequence (or start an associated action). So having the same interface for controlling both internal and external workers simply makes the interface uniform so it's easier to add common functionality later (like those timeouts, follow-up actions etc.).

bmarzins mentioned this issue Mar 16, 2022

Worker cleanup #124

Merged

prajnoha closed this as completed in #124 Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event loop's epoll FD inherited from main process not closed in worker #85

Event loop's epoll FD inherited from main process not closed in worker #85

prajnoha commented Feb 22, 2021

prajnoha commented Mar 1, 2021

prajnoha commented Mar 1, 2021

trgill commented Jun 29, 2021

bmarzins commented Mar 8, 2022

prajnoha commented Mar 9, 2022

prajnoha commented Mar 9, 2022

bmarzins commented Mar 9, 2022 •

edited

Loading

bmarzins commented Mar 9, 2022

prajnoha commented Mar 10, 2022

prajnoha commented Mar 10, 2022 •

edited

Loading

Event loop's epoll FD inherited from main process not closed in worker #85

Event loop's epoll FD inherited from main process not closed in worker #85

Comments

prajnoha commented Feb 22, 2021

prajnoha commented Mar 1, 2021

prajnoha commented Mar 1, 2021

trgill commented Jun 29, 2021

bmarzins commented Mar 8, 2022

prajnoha commented Mar 9, 2022

prajnoha commented Mar 9, 2022

bmarzins commented Mar 9, 2022 • edited Loading

bmarzins commented Mar 9, 2022

prajnoha commented Mar 10, 2022

prajnoha commented Mar 10, 2022 • edited Loading

bmarzins commented Mar 9, 2022 •

edited

Loading

prajnoha commented Mar 10, 2022 •

edited

Loading