Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't clear connectTimeout until ConnectResponse is received #57

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arekinath
Copy link

@arekinath arekinath commented Feb 14, 2017

We should not clear the connectTimeout (in ConnectionManager) until after we receive ConnectResponse (currently we clear it as soon as we get the TCP 'connect' event). Otherwise we leave a window between when the TCP connection emits 'connect' and we send the ConnectRequest, and when the ConnectResponse comes back, where we have no timeout applying at all and will make no attempt to talk to the socket again.

If we get netsplit from ZK during such a window (and stay netsplit for long enough to both go past the ZK session timeout, and then miss any of its OS' FIN retransmits, which may only be a few minutes), we may hang in the CONNECTING state indefinitely, never reconnecting.

Since the OS on the ZK side has given up on the socket and forgotten about it (its FINs/RSTs never got through during the netsplit), the only way our local OS could find out about the issue is if we tried to send data (then the remote side would send RST back), which we will not do (since we haven't got the ConnectResponse yet or set up the ping timeout). So we will never receive any kind of 'error' event from this socket, and no 'timeout' (since we haven't set one up yet, we only do that after the ConnectResponse).

This window of time may sound small, but sometimes a heavily loaded ZK (such as when it's in the middle of dealing with, say, a netsplit in the cluster) can take quite a long time between when its host OS ACKs the ConnectRequest we sent, and when it then tries to send a ConnectResponse. We've seen windows in prod here of up to almost 1 sec.

We've hit this bug in prod repeatedly during widespread network events (e.g. switch fabric reconfiguration). I can also reproduce it reliably in the lab by making the ZK connect sequence artificially slow (adding a sleep) and starting a netsplit at the right time.

Otherwise we leave a window between when the TCP connection emits
'connect' and we've sent the ConnectRequest, and when the
ConnectResponse comes back, when we have no timeout applying at
all and no attempt to talk to the socket again.

If we get netsplit from ZK during such a window, we may hang in
the CONNECTING state indefinitely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant