-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lighthouse] detect unhealthy participants via heartbeats #64
Conversation
529233a
to
0f5fa8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! lgtm. tests can be in a follow up PR so not blocking :)
src/lighthouse.rs
Outdated
state: &RoomState, | ||
opt: &LighthouseOpt, | ||
) -> (bool, String) { | ||
let mut first_joined = now; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a request for change but a clarification. This actually isnt needed right? first joined will be min(join time of participants waiting in room).
Or is it used for if theres no participants in the room? I would think min replica should be the one catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's basically what this is computing -- this is just the initial value before we calculate the minimum
.filter(|(replica_id, _details)| { | ||
let last_heartbeat = heartbeats.get(replica_id); | ||
if last_heartbeat.is_none() { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an optional message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We print the "heathly" replica count in the status message below, the dashboard should make this pretty obvious if no replicas have heartbeated
I just realized that this controls quorum but still includes those workers in the finalized quorum, need to fix that and update the tests |
0f5fa8b
to
62c0f4d
Compare
Also caught a bug where we won't increase the world size if a fast quorum is found -- now with fast quorum we will gracefully increase if there are available workers |
state: &RoomState, | ||
opt: &LighthouseOpt, | ||
) -> (Option<Vec<QuorumMember>>, String) { | ||
let healthy_participants: HashMap<String, QuorumMemberDetails> = state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it might be worth refactoring calculating healthy participants to a separate helper? Then the logic for quorum could be a bit simpler:
health checker component -- maintains the state of all healthy participants
quorum:
- uses the health checker component to see which participants are alive and their states
} | ||
} | ||
|
||
if state.participants.len() < opt.min_replicas as usize { | ||
if healthy_participants.len() < opt.min_replicas as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we maintain a sequence number for each participant? For example, we could have a race where:
- min members == 5
- currently 4 have heartbeated
- the fifth is just about to heartbeat, but then this code runs first and the fifth gets excluded
- repeat over and over
If we have sequence numbers for each checkin, then if the fifth one is on a lower sequence number (but still healthy), we know we should keep waiting until the fifth one bumps the sequence number up to the expected sequence number.
This makes the quorum only consider participants which have heathly heartbeats.
This also exposes heartbeat settings to both the lighthouse (for how long we should consider a replica unhealthy) and in the manager for how frequently we should heartbeat.
I do feel like the quorum code is getting a bit messy -- though, with the joiners changes that will likely happen soon I'll be refactoring/removing the room state which should clean this up significantly.
Test plan:
pyre
We don't currently have an e2e test where the heartbeats timeout
I'll also run a manual test to ensure we get the expected behavior