[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and `--reuse-port` option #1937

tcompa · 2024-10-17T09:49:39Z

For the moment this is just a placeholder with relevant links:

mfranzon · 2024-10-18T12:09:41Z

Takes from a deeper review:

A single trivial endpoint (/api/alive/) has been tested, which returns the pid of the process on which it is being executed, which corresponds to a gunicorn worker.

As illustrated in the issues of the previous message, gunicorn does NOT introduce any load balancing activity but declines responsibility to the operating system scheduler.

Testing 5000 calls with 12 workers on a local PC (ubuntu22) we observed that:

the distribution of calls is not homogeneous, some have a much greater load than others

PID Statistics:
PID: 44740, Count: 1058
PID: 44736, Count: 616
PID: 44739, Count: 870
PID: 44737, Count: 521
PID: 44732, Count: 337
PID: 44726, Count: 154
PID: 44741, Count: 412
PID: 44725, Count: 12
PID: 44734, Count: 9
PID: 44733, Count: 6
PID: 44724, Count: 3
PID: 44735, Count: 2

using the fix-branch-reuse-port, that improves the use of the SO_REUSEPORT socket option, leads to an improvement in the distribution of the various workers

PID Statistics:
PID: 168883, Count: 401
PID: 168881, Count: 447
PID: 168888, Count: 421
PID: 168889, Count: 426
PID: 168887, Count: 426
PID: 168882, Count: 421
PID: 168886, Count: 399
PID: 168884, Count: 431
PID: 168880, Count: 448
PID: 168885, Count: 394
PID: 168890, Count: 398
PID: 168927, Count: 388

Further considerations must be made:

the scheduling algorithms can vary between different hosts and different operating systems
for small workloads, (<100) calls, no variation in the load management is appreciated between vanilla gunicorn, and gunicorn with fix

No Fix

PID Statistics:
PID: 124309, Count: 3
PID: 124305, Count: 9
PID: 124306, Count: 14
PID: 124304, Count: 9
PID: 124311, Count: 9
PID: 124314, Count: 16
PID: 124312, Count: 8
PID: 124308, Count: 8
PID: 124310, Count: 7
PID: 124313, Count: 10
PID: 124307, Count: 5
PID: 124303, Count: 2

With Fix

PID Statistics:
PID: 168880, Count: 14
PID: 168887, Count: 12
PID: 168882, Count: 5
PID: 168885, Count: 8
PID: 168888, Count: 10
PID: 168927, Count: 8
PID: 168890, Count: 6
PID: 168884, Count: 11
PID: 168889, Count: 6
PID: 168881, Count: 8
PID: 168883, Count: 6
PID: 168886, Count: 6

mfranzon · 2024-10-18T14:06:14Z

More on this (@tcompa):

In the current state (no patch), all sockets are on the same port. In this situation, the OS contacts the different sockets with non-homogeneous frequencies. By adding the gunicorn patch (see previous comment) and the SO_REUSEPORT option, the N sockets are on N different ports. In this situation, the OS contacts the different sockets in a seemingly random - and therefore homogeneous - manner.

Example current state:

$ lsof -i | grep 8000
gunicorn  76496 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)   # master gunicorn
gunicorn  76500 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)
gunicorn  76501 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)
gunicorn  76502 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)
gunicorn  76503 tommaso    5u  IPv4 466962      0t0  TCP localhost:8000 (LISTEN)

Example with patch and --reuse-port

$ lsof -i | grep 8000
gunicorn  75823 tommaso    6u  IPv4 463247      0t0  TCP localhost:8000 (LISTEN)
gunicorn  75824 tommaso    5u  IPv4 459620      0t0  TCP localhost:8000 (LISTEN)
gunicorn  75825 tommaso    5u  IPv4 464099      0t0  TCP localhost:8000 (LISTEN)
gunicorn  75827 tommaso    5u  IPv4 457392      0t0  TCP localhost:8000 (LISTEN)

tcompa · 2024-10-21T07:18:17Z

Current TLDR:

We tracked the worker logic all the way until it's eventually handled by the OS - and we stopped there.
We confirm that using --reuse-port together with the gunicorn patch from Fix reuse-port to balance requests across Gunicorn workers benoitc/gunicorn#2938 leads to a more even distribution of requests across workers, even for small number of requests.
For the typical use case, the distribution of requests across workers is relevant to optimize resource usage (e.g. CPU)
For our use case, the actual reason why we want to distribute requests better is to mitigate a general issue with our background operations (namely that all background operations stemming from the same worker will share some single-use resources, notably the SSH-connection object).
We are not planning to rely on an unreleased gunicorn patch, but we'll keep an eye on the PR - hoping it moves forward and gets merged upstream.

jluethi added this to Fractal Project Management Oct 17, 2024

github-project-automation bot moved this to TODO in Fractal Project Management Oct 17, 2024

tcompa assigned mfranzon Oct 18, 2024

tcompa changed the title ~~Review gunicorn/OS load-balancing~~ [on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing Nov 19, 2024

tcompa changed the title ~~[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing~~ [on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and --reuse-port option Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and `--reuse-port` option #1937

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and `--reuse-port` option #1937

tcompa commented Oct 17, 2024

mfranzon commented Oct 18, 2024

mfranzon commented Oct 18, 2024

tcompa commented Oct 21, 2024

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and --reuse-port option #1937

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and --reuse-port option #1937

Comments

tcompa commented Oct 17, 2024

mfranzon commented Oct 18, 2024

mfranzon commented Oct 18, 2024

tcompa commented Oct 21, 2024

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and `--reuse-port` option #1937

[on hold, waiting for gunicorn release] Review gunicorn/OS load-balancing and `--reuse-port` option #1937