Skip to content

Commit

Permalink
Load balancer health checks problematic hosts (#1709)
Browse files Browse the repository at this point in the history
Motivation:

The current implementation of `RoundRobinLoadBalancer` cycles through all addresses that the `ServiceDiscoverer` provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption.
However, there are occasions when a particular `ServiceDiscoverer` (e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path.
In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background `io.servicetalk.concurrent.api.Executor` for running the checks.

Modifications:

- Consecutive connection attempts to ACTIVE hosts are counted in the internal RRLB's Host state,
- After a threshold is met, a background task is scheduled which will attempt a connection at a specified interval,
- Meanwhile, the particular address is not considered for directing traffic and opening connections,
- Whenever the background task successfully establishes a connection, that connection is used for directing requests and the host comes back to the list of eligible for selection in the request path,
- `RoundRobinConnectionFactory.Builder` was enhanced to incorporate this mechanism.

Result:

Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS `ServiceDiscoverer` users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.
  • Loading branch information
Dariusz Jedrzejczyk authored Aug 17, 2021
1 parent 2c27755 commit f455e6f
Show file tree
Hide file tree
Showing 5 changed files with 624 additions and 80 deletions.
Loading

0 comments on commit f455e6f

Please sign in to comment.