Clustering stability #879

carlhoerberg · 2024-12-13T11:51:13Z

WHAT is this pull request doing?

Improving clustering

HOW can this pull request be tested?

viktorerlingsson · 2024-12-18T10:00:10Z

Ran some tests on this and sometimes (like 3 out of 10 times) I'm getting this error when shutting down the leader, and the follower will not take over as leader. Haven't looked further into why.

2024-12-18T09:56:32.813755Z  INFO lmq.data_dir_lock Data directory locked by 'PID 243709 @ viktor-lenovo'
2024-12-18T09:56:32.813812Z  INFO lmq.data_dir_lock Waiting for file lock to be released
2024-12-18T09:56:32.813941Z ERROR lmq.etcd Lost leadership
Lease 7587883508156972415 expired (LavinMQ::Etcd::Error)
  from src/lavinmq/etcd.cr:71:7 in 'lease_ttl'
  from src/lavinmq/etcd.cr:121:15 in 'keepalive_loop'
  from src/lavinmq/etcd.cr:101:9 in '->'
  from /usr/share/crystal/src/fiber.cr:143:11 in 'run'
  from /usr/share/crystal/src/fiber.cr:95:34 in '->'
  from ???

carlhoerberg · 2024-12-24T22:43:04Z

Ran some tests on this and sometimes (like 3 out of 10 times) I'm getting this error when shutting down the leader, and the follower will not take over as leader. Haven't looked further into why.

Thanks, fixed in 57b583d

Catch and explicity reraise IO::Errors in etcd, otherwise when an Etcd method yielded, and that inner call raised IO::Error that was interpreted as a Etcd error. Extra logging related to Following Start etcd lease keepalive after won election Apprently there's no need to update lease TTL until the election is won Refactor Leadership lease keepalive dont log Lost leadership if manually revoked etcd error are sometimes json, sometimes not don't let Launcher know about clustering/leases Let it be a concern for Clustering Controller No need to poll the data dir lock, because it's only required for NFS disks.

we want to timeout when waiting for acks, if the follower is unresponsive

use custom ports for the specs

can't see the need

Config.instance was used heavily in Server

If the Launcher receives an Etcd, Launcher creates, and later closes, the ClusteringServer instance.

No need to use the getter

carlhoerberg · 2024-12-31T11:50:21Z

All commits are independent and does different things, only the first is really related to this PR.

carlhoerberg marked this pull request as ready for review December 17, 2024 00:55

carlhoerberg requested a review from a team as a code owner December 17, 2024 00:55

carlhoerberg force-pushed the clustering-stability branch from 0667e3e to a336fa3 Compare December 27, 2024 20:52

carlhoerberg added 4 commits December 30, 2024 22:57

always have a read timeout on Follower sockets

16ec616

we want to timeout when waiting for acks, if the follower is unresponsive

make it possible to run a local etcd while running specs

0ce7cf3

use custom ports for the specs

don't fsync in DataDirLock

ef92ed1

can't see the need

carlhoerberg force-pushed the clustering-stability branch from 1fbb459 to f599d2c Compare December 30, 2024 22:01

carlhoerberg added 4 commits December 31, 2024 12:46

make federation upstream spec more stable in slow CI

f93c2ac

Pass Config to Server, not just data_dir

33f5d48

Config.instance was used heavily in Server

Move responsibility of Clustering to Launcher

0a7ca56

If the Launcher receives an Etcd, Launcher creates, and later closes, the ClusteringServer instance.

In Actions, get filename via instance variable

02969a3

No need to use the getter

carlhoerberg force-pushed the clustering-stability branch from f599d2c to 02969a3 Compare December 31, 2024 11:46

carlhoerberg merged commit 3c0813c into main Jan 1, 2025
23 of 25 checks passed

carlhoerberg deleted the clustering-stability branch January 1, 2025 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering stability #879

Clustering stability #879

carlhoerberg commented Dec 13, 2024

viktorerlingsson commented Dec 18, 2024

carlhoerberg commented Dec 24, 2024 •

edited

Loading

carlhoerberg commented Dec 31, 2024

Clustering stability #879

Clustering stability #879

Conversation

carlhoerberg commented Dec 13, 2024

WHAT is this pull request doing?

HOW can this pull request be tested?

viktorerlingsson commented Dec 18, 2024

carlhoerberg commented Dec 24, 2024 • edited Loading

carlhoerberg commented Dec 31, 2024

carlhoerberg commented Dec 24, 2024 •

edited

Loading