Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering stability #879

Merged
merged 8 commits into from
Jan 1, 2025
Merged

Clustering stability #879

merged 8 commits into from
Jan 1, 2025

Conversation

carlhoerberg
Copy link
Member

WHAT is this pull request doing?

Improving clustering

HOW can this pull request be tested?

@carlhoerberg carlhoerberg marked this pull request as ready for review December 17, 2024 00:55
@carlhoerberg carlhoerberg requested a review from a team as a code owner December 17, 2024 00:55
@viktorerlingsson
Copy link
Member

Ran some tests on this and sometimes (like 3 out of 10 times) I'm getting this error when shutting down the leader, and the follower will not take over as leader. Haven't looked further into why.

2024-12-18T09:56:32.813755Z  INFO lmq.data_dir_lock Data directory locked by 'PID 243709 @ viktor-lenovo'
2024-12-18T09:56:32.813812Z  INFO lmq.data_dir_lock Waiting for file lock to be released
2024-12-18T09:56:32.813941Z ERROR lmq.etcd Lost leadership
Lease 7587883508156972415 expired (LavinMQ::Etcd::Error)
  from src/lavinmq/etcd.cr:71:7 in 'lease_ttl'
  from src/lavinmq/etcd.cr:121:15 in 'keepalive_loop'
  from src/lavinmq/etcd.cr:101:9 in '->'
  from /usr/share/crystal/src/fiber.cr:143:11 in 'run'
  from /usr/share/crystal/src/fiber.cr:95:34 in '->'
  from ???

@carlhoerberg
Copy link
Member Author

carlhoerberg commented Dec 24, 2024

Ran some tests on this and sometimes (like 3 out of 10 times) I'm getting this error when shutting down the leader, and the follower will not take over as leader. Haven't looked further into why.

Thanks, fixed in 57b583d

Catch and explicity reraise IO::Errors in etcd, otherwise
when an Etcd method yielded, and that inner call raised
IO::Error that was interpreted as a Etcd error.

Extra logging related to Following

Start etcd lease keepalive after won election
Apprently there's no need to update lease TTL until the election is won

Refactor Leadership lease keepalive

dont log Lost leadership if manually revoked

etcd error are sometimes json, sometimes not

don't let Launcher know about clustering/leases

Let it be a concern for Clustering Controller

No need to poll the data dir lock, because it's only required for NFS
disks.
we want to timeout when waiting for acks, if the follower is
unresponsive
can't see the need
Config.instance was used heavily in Server
If the Launcher receives an Etcd, Launcher creates,
and later closes, the ClusteringServer instance.
@carlhoerberg
Copy link
Member Author

All commits are independent and does different things, only the first is really related to this PR.

@carlhoerberg carlhoerberg merged commit 3c0813c into main Jan 1, 2025
23 of 25 checks passed
@carlhoerberg carlhoerberg deleted the clustering-stability branch January 1, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants