Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and --reuse-port option #1937

Open
tcompa opened this issue Oct 17, 2024 · 3 comments
Assignees

Comments

@mfranzon
Copy link
Collaborator

Takes from a deeper review:

A single trivial endpoint (/api/alive/) has been tested, which returns the pid of the process on which it is being executed, which corresponds to a gunicorn worker.

As illustrated in the issues of the previous message, gunicorn does NOT introduce any load balancing activity but declines responsibility to the operating system scheduler.

Testing 5000 calls with 12 workers on a local PC (ubuntu22) we observed that:

  • the distribution of calls is not homogeneous, some have a much greater load than others
PID Statistics:
PID: 44740, Count: 1058
PID: 44736, Count: 616
PID: 44739, Count: 870
PID: 44737, Count: 521
PID: 44732, Count: 337
PID: 44726, Count: 154
PID: 44741, Count: 412
PID: 44725, Count: 12
PID: 44734, Count: 9
PID: 44733, Count: 6
PID: 44724, Count: 3
PID: 44735, Count: 2
  • using the fix-branch-reuse-port, that improves the use of the SO_REUSEPORT socket option, leads to an improvement in the distribution of the various workers
PID Statistics:
PID: 168883, Count: 401
PID: 168881, Count: 447
PID: 168888, Count: 421
PID: 168889, Count: 426
PID: 168887, Count: 426
PID: 168882, Count: 421
PID: 168886, Count: 399
PID: 168884, Count: 431
PID: 168880, Count: 448
PID: 168885, Count: 394
PID: 168890, Count: 398
PID: 168927, Count: 388

Further considerations must be made:

  • the scheduling algorithms can vary between different hosts and different operating systems
  • for small workloads, (<100) calls, no variation in the load management is appreciated between vanilla gunicorn, and gunicorn with fix

No Fix

PID Statistics:
PID: 124309, Count: 3
PID: 124305, Count: 9
PID: 124306, Count: 14
PID: 124304, Count: 9
PID: 124311, Count: 9
PID: 124314, Count: 16
PID: 124312, Count: 8
PID: 124308, Count: 8
PID: 124310, Count: 7
PID: 124313, Count: 10
PID: 124307, Count: 5
PID: 124303, Count: 2

With Fix

PID Statistics:
PID: 168880, Count: 14
PID: 168887, Count: 12
PID: 168882, Count: 5
PID: 168885, Count: 8
PID: 168888, Count: 10
PID: 168927, Count: 8
PID: 168890, Count: 6
PID: 168884, Count: 11
PID: 168889, Count: 6
PID: 168881, Count: 8
PID: 168883, Count: 6
PID: 168886, Count: 6

@mfranzon
Copy link
Collaborator

More on this (@tcompa):

In the current state (no patch), all sockets are on the same port. In this situation, the OS contacts the different sockets with non-homogeneous frequencies. By adding the gunicorn patch (see previous comment) and the SO_REUSEPORT option, the N sockets are on N different ports. In this situation, the OS contacts the different sockets in a seemingly random - and therefore homogeneous - manner.

Example current state:

$ lsof -i | grep 8000
gunicorn  76496 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)   # master gunicorn
gunicorn  76500 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)
gunicorn  76501 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)
gunicorn  76502 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)
gunicorn  76503 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)

Example with patch and --reuse-port

$ lsof -i | grep 8000
gunicorn  75823 tommaso    6u  IPv4 463247      0t0  TCP localhost:8000 (LISTEN)
gunicorn  75824 tommaso    5u  IPv4 459620      0t0  TCP localhost:8000 (LISTEN)
gunicorn  75825 tommaso    5u  IPv4 464099      0t0  TCP localhost:8000 (LISTEN)
gunicorn  75827 tommaso    5u  IPv4 457392      0t0  TCP localhost:8000 (LISTEN)

@tcompa
Copy link
Collaborator Author

tcompa commented Oct 21, 2024

Current TLDR:

  • We tracked the worker logic all the way until it's eventually handled by the OS - and we stopped there.
  • We confirm that using --reuse-port together with the gunicorn patch from Fix reuse-port to balance requests across Gunicorn workers benoitc/gunicorn#2938 leads to a more even distribution of requests across workers, even for small number of requests.
  • For the typical use case, the distribution of requests across workers is relevant to optimize resource usage (e.g. CPU)
  • For our use case, the actual reason why we want to distribute requests better is to mitigate a general issue with our background operations (namely that all background operations stemming from the same worker will share some single-use resources, notably the SSH-connection object).
  • We are not planning to rely on an unreleased gunicorn patch, but we'll keep an eye on the PR - hoping it moves forward and gets merged upstream.

@tcompa tcompa changed the title Review gunicorn/OS load-balancing [on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing Nov 19, 2024
@tcompa tcompa changed the title [on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing [on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and --reuse-port option Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants