Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie process leaks #4554

Closed
goonzoid opened this issue Jan 4, 2025 · 2 comments · Fixed by #5353
Closed

Zombie process leaks #4554

goonzoid opened this issue Jan 4, 2025 · 2 comments · Fixed by #5353
Labels
Milestone

Comments

@goonzoid
Copy link
Member

goonzoid commented Jan 4, 2025

When surfaces close, we sometimes fail to call wait/waitpid on the associated process, which results in it hanging around as a zombie for the lifetime of the Ghostty application. There is a related discussion which I had missed until making this issue at #3596.

I can reproduce this reliably on MacOS. Strangely, while we also fail to call waitpid on the subprocess on linux, it still gets reaped. I did some pretty deep digging trying to figure out what is causing this, but came up empty handed. Happy to discuss this aspect further and share what I tried if it becomes relevant.

Reproduction Steps

I've been able to reproduce this in both the GLFW and SwiftUI builds on MacOS.

  1. Launch Ghostty
  2. Create a new tab (so now you have two)
  3. Close the tab using Ctrl-D at the prompt, or by running exit (closing the tab directly does not trigger this)
  4. Take a look at the process tree. You should see a zombie process that is a direct child of the core ghostty process. On MacOS, this is the login(1) wrapper process. Depending on what you use to inspect the process tree, it may appear as <defunct>.

What's happening?

There are some longer explanations (largely correct, but perhaps a bit confused) in the edit history of this issue. Here's the short version after more debugging:

When the surface's subprocess exits independent of Ghostty, we null out the Command before calling stop on it, which means we never call waitpid. Simple as that.

The fix ended up being a bit more complicated, due to a quirk in the implementation of waitpid we were using in the Zig standard library, and because of the mysterious unknown force that's somehow causing the subprocess to be reaped already on linux. I'll discuss the details of the fix further in my PR to make reviewing easier.

@basile-henry
Copy link
Collaborator

So you don't think the process hanging around after Ctrl-D on Linux is related to this bug on MacOS?
There's more detail about the Linux version in this other discussion: #3495 (it's a shame they don't automatically cross link like issues do)

@mitchellh mitchellh added this to the 1.0.2 milestone Jan 4, 2025
@goonzoid
Copy link
Member Author

goonzoid commented Jan 5, 2025

So you don't think the process hanging around after Ctrl-D on Linux is related to this bug on MacOS? There's more detail about the Linux version in this other discussion: #3495 (it's a shame they don't automatically cross link like issues do)

It's hard to say for sure. Unfortunately I don't have a working Linux machine at the moment to test with. The symptoms seem similar, but some details about the logs mentioned in the discussion you link to makes me uncertain that it's the same thing.

You could try building from my fork and seeing that fixes it. Let me know what happens if so!

mitchellh added a commit that referenced this issue Jan 24, 2025
Fixes #4554

When a process exited on its own (we didn't kill it), we were not
calling waitpid. This commit adds a waitpid call in the event loop
process exit notification.

This happened because when an external exit happened we could call
`subprocess.externalExit` which tells our subprocess manager to NOT kill
the process (since its already dead). Unfortunately in this case it
means we also didn't call waitpid.
mitchellh added a commit that referenced this issue Jan 24, 2025
Fixes #4554

When a process exited on its own (we didn't kill it), we were not
calling waitpid. This commit adds a waitpid call in the event loop
process exit notification.

This happened because when an external exit happened we could call
`subprocess.externalExit` which tells our subprocess manager to NOT kill
the process (since its already dead). Unfortunately in this case it
means we also didn't call waitpid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants