BUG/MEDIUM: queue: always dequeue the backend when redistributing the…

… last server An interesting bug was revealed by commit 5541d49 ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()"). When shutting down a server to redistribute its connections, no check is made on the backend's queue. If we're turning off the last server and the backend has pending connections, these ones will wait there till the queue timeout. But worse, since the commit above, we can enter an endless loop in the following situation: - streams are present in the backend's queue - streams are purged on the last server via srv_shutdown_streams() - that one calls pendconn_redistribute(srv) which does not purge the backend's pendconns - a stream performs some load balancing and enters assign_server_and_queue() - assign_server() is called in turn - the LB algo is non-deterministic and there are entries in the backend's queue. The function notices it and returns SRV_STATUS_FULL - assign_server_and_queue() calls pendconn_add() to add the connection to the backend's queue - on return, pendconn_must_try_again() is called, it figures there's no stream served anymore on the server nor the proxy, so it removes the pendconn from the queue and returns 1 - assign_server_and_queue() loops back to the beginning to try again, while the conditions have not changed, resulting in an endless loop. Ideally a change count should be used in the queues so that it's possible to detect that some dequeuing happened and/or that a last stream has left. But that wouldn't completely solve the problem that is that we must never ever add to a queue when there's no server streams to dequeue the new entries. The current solution consists in making pendconn_redistribute() take care of the proxy after the server in case there's no more server available on the proxy. It at least ensures that no pending streams are left in the backend's queue when shutting streams down or when the last server goes down. The try_again loop remains necessary to deal with inevitable races during pendconn additions. It could be limited to a few rounds, though, but it should never trigger if the conditions are sufficient to permit it to converge. One way to reproduce the issue is to run a config with a single server with maxconn 1 and plenty of threads, then run in loops series of: "disable server px/s;shutdown sessions server px/s; wait 100ms server-removable px/s; show servers conn px; enable server px/s" on the CLI at ~10/s while injecting with around 40 concurrent conns at 40-100k RPS. In this case in 10s - 1mn the crash can appear with a backtrace like this one for at least 1 thread: #0 pendconn_add (strm=strm@entry=0x17f2ce0) at src/queue.c:487 haproxy#1 0x000000000064797d in assign_server_and_queue (s=s@entry=0x17f2ce0) at src/backend.c:1064 haproxy#2 0x000000000064a928 in srv_redispatch_connect (s=s@entry=0x17f2ce0) at src/backend.c:1962 haproxy#3 0x000000000064ac54 in back_handle_st_req (s=s@entry=0x17f2ce0) at src/backend.c:2287 haproxy#4 0x00000000005ae1d5 in process_stream (t=t@entry=0x17f4ab0, context=0x17f2ce0, state=<optimized out>) at src/stream.c:2336 It's worth noting that other threads may often appear waiting after the poller and one in server_atomic_sync() waiting for isolation, because the event that is processed when shutting the server down is consumed under isolation, and having less threads available to dequeue remaining requests increases the probability to trigger the problem, though it is not at all necessary (some less common traces never show them). This should carefully be backported wherever the commit above was backported.
FireBurn · Oct 1, 2024 · 53f52e6 · 53f52e6
1 parent 8d68717
commit 53f52e6
Showing 1 changed file with 30 additions and 2 deletions.
diff --git a/src/queue.c b/src/queue.c
@@ -495,14 +495,16 @@ int pendconn_redistribute(struct server *s)
 {
 	struct pendconn *p;
 	struct eb32_node *node, *nodeb;
+	struct proxy *px = s->proxy;
+	int px_xferred = 0;
 	int xferred = 0;
 
 	/* The REDISP option was specified. We will ignore cookie and force to
 	 * balance or use the dispatcher.
 	 */
 	if (!(s->cur_admin & SRV_ADMF_MAINT) &&
 	    (s->proxy->options & (PR_O_REDISP|PR_O_PERSIST)) != PR_O_REDISP)
-		return 0;
+		goto skip_srv_queue;
 
 	HA_SPIN_LOCK(QUEUE_LOCK, &s->queue.lock);
 	for (node = eb32_first(&s->queue.head); node; node = nodeb) {
@@ -526,7 +528,33 @@ int pendconn_redistribute(struct server *s)
 		_HA_ATOMIC_SUB(&s->queue.length, xferred);
 		_HA_ATOMIC_SUB(&s->proxy->totpend, xferred);
 	}
-	return xferred;
+
+ skip_srv_queue:
+	if (px->lbprm.tot_wact || px->lbprm.tot_wbck)
+		goto done;
+
+	HA_SPIN_LOCK(QUEUE_LOCK, &px->queue.lock);
+	for (node = eb32_first(&px->queue.head); node; node = nodeb) {
+		nodeb =	eb32_next(node);
+		p = eb32_entry(node, struct pendconn, node);
+
+		/* force-persist streams may occasionally appear in the
+		 * proxy's queue, and we certainly don't want them here!
+		 */
+		p->strm_flags &= ~SF_FORCE_PRST;
+		__pendconn_unlink_prx(p);
+
+		task_wakeup(p->strm->task, TASK_WOKEN_RES);
+		px_xferred++;
+	}
+	HA_SPIN_UNLOCK(QUEUE_LOCK, &px->queue.lock);
+
+	if (px_xferred) {
+		_HA_ATOMIC_SUB(&px->queue.length, px_xferred);
+		_HA_ATOMIC_SUB(&px->totpend, px_xferred);
+	}
+ done:
+	return xferred + px_xferred;
 }
 
 /* Check for pending connections at the backend, and assign some of them to