Don't clear connectTimeout until ConnectResponse is received #57
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We should not clear the connectTimeout (in ConnectionManager) until after we receive ConnectResponse (currently we clear it as soon as we get the TCP 'connect' event). Otherwise we leave a window between when the TCP connection emits 'connect' and we send the ConnectRequest, and when the ConnectResponse comes back, where we have no timeout applying at all and will make no attempt to talk to the socket again.
If we get netsplit from ZK during such a window (and stay netsplit for long enough to both go past the ZK session timeout, and then miss any of its OS' FIN retransmits, which may only be a few minutes), we may hang in the CONNECTING state indefinitely, never reconnecting.
Since the OS on the ZK side has given up on the socket and forgotten about it (its FINs/RSTs never got through during the netsplit), the only way our local OS could find out about the issue is if we tried to send data (then the remote side would send RST back), which we will not do (since we haven't got the ConnectResponse yet or set up the ping timeout). So we will never receive any kind of 'error' event from this socket, and no 'timeout' (since we haven't set one up yet, we only do that after the ConnectResponse).
This window of time may sound small, but sometimes a heavily loaded ZK (such as when it's in the middle of dealing with, say, a netsplit in the cluster) can take quite a long time between when its host OS ACKs the ConnectRequest we sent, and when it then tries to send a ConnectResponse. We've seen windows in prod here of up to almost 1 sec.
We've hit this bug in prod repeatedly during widespread network events (e.g. switch fabric reconfiguration). I can also reproduce it reliably in the lab by making the ZK connect sequence artificially slow (adding a sleep) and starting a netsplit at the right time.