Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lighthouse] detect unhealthy participants via heartbeats #64

Merged
merged 1 commit into from
Jan 11, 2025

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Jan 10, 2025

This makes the quorum only consider participants which have heathly heartbeats.

This also exposes heartbeat settings to both the lighthouse (for how long we should consider a replica unhealthy) and in the manager for how frequently we should heartbeat.

I do feel like the quorum code is getting a bit messy -- though, with the joiners changes that will likely happen soon I'll be refactoring/removing the room state which should clean this up significantly.

Test plan:

pyre

pytest
cargo test

We don't currently have an e2e test where the heartbeats timeout

I'll also run a manual test to ensure we get the expected behavior

@d4l3k d4l3k requested review from kennyyu and Jackmin801 January 10, 2025 07:10
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 10, 2025
@d4l3k d4l3k force-pushed the d4l3k/quorum_heartbeats branch from 529233a to 0f5fa8b Compare January 10, 2025 17:58
Copy link
Collaborator

@Jackmin801 Jackmin801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! lgtm. tests can be in a follow up PR so not blocking :)

state: &RoomState,
opt: &LighthouseOpt,
) -> (bool, String) {
let mut first_joined = now;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a request for change but a clarification. This actually isnt needed right? first joined will be min(join time of participants waiting in room).

Or is it used for if theres no participants in the room? I would think min replica should be the one catching this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's basically what this is computing -- this is just the initial value before we calculate the minimum

.filter(|(replica_id, _details)| {
let last_heartbeat = heartbeats.get(replica_id);
if last_heartbeat.is_none() {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an optional message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We print the "heathly" replica count in the status message below, the dashboard should make this pretty obvious if no replicas have heartbeated

@d4l3k
Copy link
Member Author

d4l3k commented Jan 10, 2025

I just realized that this controls quorum but still includes those workers in the finalized quorum, need to fix that and update the tests

@d4l3k d4l3k force-pushed the d4l3k/quorum_heartbeats branch from 0f5fa8b to 62c0f4d Compare January 11, 2025 01:04
@d4l3k
Copy link
Member Author

d4l3k commented Jan 11, 2025

Also caught a bug where we won't increase the world size if a fast quorum is found -- now with fast quorum we will gracefully increase if there are available workers

@d4l3k d4l3k merged commit 2f97660 into main Jan 11, 2025
6 checks passed
@d4l3k d4l3k deleted the d4l3k/quorum_heartbeats branch January 11, 2025 01:19
state: &RoomState,
opt: &LighthouseOpt,
) -> (Option<Vec<QuorumMember>>, String) {
let healthy_participants: HashMap<String, QuorumMemberDetails> = state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it might be worth refactoring calculating healthy participants to a separate helper? Then the logic for quorum could be a bit simpler:

health checker component -- maintains the state of all healthy participants

quorum:

  • uses the health checker component to see which participants are alive and their states

}
}

if state.participants.len() < opt.min_replicas as usize {
if healthy_participants.len() < opt.min_replicas as usize {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maintain a sequence number for each participant? For example, we could have a race where:

  1. min members == 5
  2. currently 4 have heartbeated
  3. the fifth is just about to heartbeat, but then this code runs first and the fifth gets excluded
  4. repeat over and over

If we have sequence numbers for each checkin, then if the fifth one is on a lower sequence number (but still healthy), we know we should keep waiting until the fifth one bumps the sequence number up to the expected sequence number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants