-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad port collision issue with alloc from different groups #24904
Comments
Hello @valodzka! |
This is a recurring but infrequent issue. The bug triggers when two randomly selected ports from 12000 Nomad ports coincide. For a single port, chance is 1/12000. With multiple ports per allocation, probability increases significantly - with 1000 servers, 2 allocations per server, and 10 ports per allocation, it reaches ~50% per deployment (our deployment is smaller though). The job I'm using is quite sensitive so I can't share it, but this should be reproducible with any job having:
|
Hi @valodzka! |
Yes, it happens during job deployment. Do I understand correctly that this is expected Nomad behavior:
|
As far as I understand, it is expected behaviour if you have the port statically configured. Nomad strives for service, so the new allocation will start without stoping the old one to ensure there is always at least one allocation serving at all times. If the port is not static, getting the same port randomly twice should be a very uncommon scenario so there is no mechanism in place for it. |
While this is true for one port, as the number of ports increases, the probability of collisions grows very rapidly (see the birthday problem for a similar case with unintuitive probability), and this also increases with the number of servers. However, if this is the "Nomad way" of handling it, I can think of a workaround, such as adding a startup pre-check script that verifies if the port is free and delays the start if it isn't. For anyone looking for workaround, if you have startup bash script you can add something like this to it: ensure_port_free() {
local -r PORT="$1"
local -r WAIT=3
local -r ATTEMPTS=6
local CHECK_OUT
for ((i=0; i<ATTEMPTS; i++)); do
CHECK_OUT=$(ss --listening --tcp --udp --numeric --no-header "( sport = :$PORT )")
if [[ ! "$CHECK_OUT" =~ ^[[:space:]]*$ ]]; then
echo "Check $i/$ATTEMPTS: Port $PORT is in use. Waiting $WAIT seconds..."
sleep $WAIT
else
return 0
fi
done
echo "Port $PORT is still in use after $ATTEMPTS checks, aborting startup"
exit 1
}
ensure_port_free $NOMAD_PORT_port1
ensure_port_free $NOMAD_PORT_port2
ensure_port_free $NOMAD_PORT_port3
... |
Nomad version
Nomad v1.8.4 (Build: 22ab32e, Date: 2024-09-17T20:18:34Z)
Environment
Debian GNU/Linux 12
Issue
Port collision occurring during deployment when Nomad allocates the same port (30943) to a new allocation while the previous allocation is still shutting down.
Reproduction steps
Degailed logs
nomad logs:
new alloc app logs:
old alloc app logs:
Expected Result
Nomad should prevent port collisions by ensuring previous allocation fully releases the port before allowing reuse.
The text was updated successfully, but these errors were encountered: