-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add --listen-backlog option in munged #139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sensible option, and the code looks good. There's just a few minor changes I'd like to see. I'll run my tests tomorrow, but I don't expect I'll have any more changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
How are you triggering the problem, and what happens when it triggers? At what scale is this happening? I want to make sure I understand the problem you're solving with this because I observed some unexpected behavior in my testing. During my tests, I was able to overrun the socket queue by setting a very small listen backlog and encoding with a large number of threads using But what I was expecting to happen was for the while-loop in I've pushed a commit to this branch that now retries the |
Hi @dun thanks a lot for your testing. The issue what I observed is the application Another thing is I double-checked the
So I'm wondering the reason why the retry didn't work is that the UNIX domain socket does not return Please let me know if anything is unclear. Thank you again for the testing. |
Yes, the reason the retry logic didn't work here was because When that retry code was originally written, the maximum backlog on Linux was at most 128. But I have some more portability testing to do yet, but I should be merging this in the next few days. And I'm hoping to get a release out with these fixes soon afterwards. |
Thank you @dun ! I'm glad we addressed this issue. |
Add the "--listen-backlog" command-line option to specify the socket's listen backlog limit. On a very busy cluster, slurmctld failed to connect to munged to decode a credential. The returned error message was "Resource temporarily unavailable" (EAGAIN). Increasing the munged listen backlog prevented this error from reoccurring. PR dun#139 Signed-off-by: Chris Dunlap <[email protected]>
Change the behavior of the "--listen-backlog" command-line option in order for a value of 0 to use the software default (MUNGE_SOCKET_BACKLOG), and a value of -1 to specify SOMAXCONN. Note that SOMAXCONN is now 4096 on current Linux systems; it was 128 before Linux 5.4 which meant that the munged socket had been using the maximum backlog up until that point. Reference: - https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html Tested: - AlmaLinux 9.3, 8.9 - Arch Linux - CentOS Linux Stream 9, Stream 8, 7.9.2009, 6.10 - Debian sid, 12.5, 11.9, 10.13, 9.13, 8.11, 7.11, 6.0.10, 5.0.10, 4.0 - Fedora 39, 38, 37 - FreeBSD 13.2 - NetBSD 9.3 - OpenBSD 7.4, 7.3 - openSUSE 15.5, 15.4 - Ubuntu 23.10, 22.04.4, 20.04.6, 18.04.6, 16.04.7, 14.04.6 PR dun#139
3b0e5c3
to
0689a1f
Compare
libmunge retries transient errors when connecting to munged. This should handle errors arising from the listening socket's queue being full. However, PR #139 uncovered a bug that did not handle EAGAIN which is returned on Linux when a nonblocking UNIX domain socket connection cannot be completed immediately. This commit fixes the while-loop in _m_msg_client_connect() that retries connect() so both EAGAIN (Linux) and ECONNREFUSED (BSD) are handled as transient errors that should be retried. This was tested by setting "--listen-backlog=1", running munged with the default 2 work threads, and running remunge with 64 threads. First, the while-loop was altered so connect() errors would not be retried. remunge could reproduce EAGAIN on Linux, and this behavior was dramatically more reproducible if vcpu > 1. remunge could reproduce ECONNREFUSED on NetBSD 9.3 with vcpu=1, and on FreeBSD 14.0 with vcpu=2. Adding back the retry logic for EAGAIN and ECONNREFUSED made this connect() failure difficult to reproduce even with "--listen-backlog=1". Tested: - AlmaLinux 9.3, 8.9 - Arch Linux - CentOS Linux Stream 9, Stream 8, 7.9.2009, 6.10 - Debian sid, 12.5, 11.9, 10.13, 9.13, 8.11, 7.11, 6.0.10, 5.0.10, 4.0 - Fedora 39, 38, 37 - FreeBSD 14.0, 13.3, 13.2 - NetBSD 9.3 - OpenBSD 7.4, 7.3 - openSUSE 15.5, 15.4 - Ubuntu 23.10, 22.04.4, 20.04.6, 18.04.6, 16.04.7, 14.04.6
@meow-watermelon Thanks again for your PR! |
Thanks @dun for the improvements and thorough testings on this change! |
This small PR adds an option
--listen-backlog
for themunged
service if the user wants to override the default domain socket backlog value.If no option is passed, then
munged
would use the default value thatMUNGE_SOCKET_BACKLOG
is defined.Thanks.