Checkpoints #415

the-mikedavis · 2024-01-29T16:08:39Z

As described in #141, we add a new effect that machines may emit:

{checkpoint, CheckpointIndex, MachineState}

This suggests that Ra should add a checkpoint. Checkpoints are essentially the same as snapshots except that they don't trigger log truncation. They reuse the ra_snapshot behavior, so they are exactly the same as snapshots on disk. They can later be promoted to actual snapshots by emitting another new effect:

{release_cursor, ReleaseCursorIndex}

which suggests that Ra should find the checkpoint with the highest index lower than or equal to ReleaseCursorIndex and rename the checkpoint file so that it moves to snapshots/ from checkpoints/. Any checkpoints lower than that index are then deleted, and log can be truncated up to that index. There is a configurable maximum number of checkpoints allowed. Adding more checkpoints after that maximum will trigger a "thinning out" where checkpoints are deleted randomly from the middle of the checkpoint list.

rabbit_fifo currently has an ad-hoc checkpointing system where it queues snapshot effects ({release_cursor, RaftIndex, DehydratedState}) regularly and emits those effects when it moves up the release cursor. The advantage of building this into Ra is that we can store the checkpoints on disk, so we can use them for machine recovery and reduce memory consumption (albeit at the cost of disk / IO).

This is useful for machines like rabbit_fifo that need to keep the log around on disk for a potentially long time (for use with the {log, Idxs, Fun, Opts} effect). If we take checkpoints regularly then recovery becomes constant-time rather than linear on the number of messages in the queue. Testing this locally on my machine against the server with a QQ containing 5 million messages¹ gives a recovery time of 10ms while main takes around 24s.

Closes #141

use the md-ra-checkpoints branch, fill the queue with perf-test -x 1 -y 0 -qq -u qq -c 3000 -C 5000000, restart the broker and grep the log file for "recovery of state machine version" ↩

src/ra_snapshot.erl

kjnilsson

This is a very high quality PR - I did managed to find one thing will need addressing (which to be fair isn't particularly intuitive) as well as a couple of nitpicks.

src/ra_server.erl

src/ra_machine.erl

src/ra_log.erl

src/ra_snapshot.erl

src/ra_log.erl

These definitions mirror the snapshot counter definitions.

Snapshot state already holds nearly everything necessary for checkpoints. We add a parameter and a field to the record for the checkpoint directory and adjust some of the `ra_snapshot` API to differentiate between snapshots and checkpoints. An alternative approach to this would be to have a separate `#ra_snapshot{}` state for snapshots and checkpoints. However we would need to change `#ra_snapshot{}` somewhat substantially to support multiple snapshots, even though there is only ever one useful snapshot. It also makes it slightly harder to ensure that there is only one checkpoint or snapshot being written at a given time, whereas this strategy can continue to use the `pending` field.

We delete any checkpoints older than the snapshot index. These will never be useful: it would be faster to recover from a more up-to-date snapshot. So we can delete the directories and omit them from the checkpoint list.

This is the same as 'ra_snapshot:current/1' but for checkpoints rather than snapshots. We will use this to recover from the most recent checkpoint if it exists and is newer than the snapshot.

This updates the snapshotting codepath to accept a new parameter SnapKind :: snapshot | checkpoint otherwise the snapshotting codepath can be almost entirely reused for checkpointing. We don't yet properly handle the `snapshot_written` event and properly complete either the snapshot or the checkpoint depending on which was written - that will be covered in the child commit.

`snapshot_written` with a `ra_snapshot:kind()` of `snapshot` retains the same behavior as before checkpoints. We add a clause to handle `checkpoint`s though which adds the checkpoint to the `#ra_snapshot{}` state and thins out any surplus checkpoints. When we write a regular `snapshot` kind, we also remove any checkpoints older than the snapshot's index. These older checkpoints have no use once there's a snapshot newer than them: recovery would only be slower and there's no point in promoting a checkpoint for a state older than the current snapshot - it would be a no-op.

This effect allows you to turn a checkpoint into a snapshot. This is a useful operation for Ra machine's like `rabbit_fifo` (quorum queues) which read information out of the log. `rabbit_fifo` may only snapshot up to an index where all enqueue/requeue commands before that index have been dequeued because snapshotting deletes the enqueue/requeue commands from the log. So `rabbit_fifo` needs a way to express that it can snapshot up to a certain index which is not necessarily the current index when it emits the `release_cursor` effect. Checkpoints contain states at older indices which are perfect for this - they just need to be moved from `DataDir/checkpoints/` to `DataDir/snapshots/`. Using a file rename minimizes the I/O cost of promotion.

This should greatly improve recovery time in degenerate cases for machines like `rabbit_fifo`. `rabbit_fifo` only snapshots up to indices where all prior enqueued messages have been consumed. So recovery time is proportional to the number of outstanding enqueued messages unless we can recover from a checkpoint. With this change and switching to checkpoints in `rabbit_fifo`, recovery should be roughly constant-time. The only part of the code necessary to change for this is `ra_snapshot:recover/1`. That function would previously recover from the snapshot if it existed but now it can check to see if there is a more recent checkpoint. The rest of the machinery in `ra_log` transparently takes care of recovering from the checkpoint's index.

Initially I have this set at 4x the snapshot interval since you might spam the `checkpoint` effect more than you would a `release_cursor` effect.

Snapshots must be fsync'd so that we can safely truncate the log. There's no need to fsync every checkpoint though: we only care about a checkpoint surviving its write to disk when we go to promote it. So we can move the call to `file:sync/1` into the promotion task. By promotion time it's possible that the file will already be synced.

This is the same helper as I added for `ra_checkpoint_SUITE`. Having the helper makes it a little easier to add new parameters to `ra_snapshot:init/X` (see the child commit).

the-mikedavis added the enhancement label Jan 29, 2024

the-mikedavis requested a review from kjnilsson January 29, 2024 16:08

the-mikedavis changed the title ~~Add effects for checkpointing~~ Checkpoints Jan 29, 2024

the-mikedavis commented Jan 29, 2024

View reviewed changes

src/ra_snapshot.erl Outdated Show resolved Hide resolved

the-mikedavis force-pushed the md-checkpoints branch from 19c648a to 7e5ae32 Compare January 29, 2024 16:45

the-mikedavis mentioned this pull request Feb 5, 2024

4.0: use Ra checkpoints in rabbit_fifo for sub-linear time recovery of QQs on boot rabbitmq/rabbitmq-server#10487

Closed

the-mikedavis force-pushed the md-checkpoints branch 2 times, most recently from 2e3f112 to 7d808e4 Compare February 5, 2024 20:16

kjnilsson requested changes Feb 12, 2024

View reviewed changes

the-mikedavis added 15 commits February 12, 2024 10:50

Add counters for checkpoints

5983854

These definitions mirror the snapshot counter definitions.

Stub an effect for creating a checkpoint

180ab36

Remove outdated checkpoints on ra_snapshot:init/5

39921b8

We delete any checkpoints older than the snapshot index. These will never be useful: it would be faster to recover from a more up-to-date snapshot. So we can delete the directories and omit them from the checkpoint list.

Add `ra_snapshot:latest_checkpoint/1'

48d30e7

This is the same as 'ra_snapshot:current/1' but for checkpoints rather than snapshots. We will use this to recover from the most recent checkpoint if it exists and is newer than the snapshot.

Add ra_log config for the checkpoint interval

f8f51e6

Initially I have this set at 4x the snapshot interval since you might spam the `checkpoint` effect more than you would a `release_cursor` effect.

Document checkpointing

2ecaf2b

Add a test suite for checkpointing

523260d

ra_snapshot_SUITE: Add a helper around ra_snapshot:init/5

acd5618

This is the same helper as I added for `ra_checkpoint_SUITE`. Having the helper makes it a little easier to add new parameters to `ra_snapshot:init/X` (see the child commit).

Allow configuring the maximum number of checkpoints

47e0733

the-mikedavis force-pushed the md-checkpoints branch from 7d808e4 to 47e0733 Compare February 12, 2024 15:51

kjnilsson added this to the 2.10.0 milestone Feb 12, 2024

the-mikedavis added 2 commits February 13, 2024 09:30

Add latest_checkpoint_index to ra_log overview map

336bded

Add an integration test for checkpoints and promotion

d9890cc

the-mikedavis marked this pull request as ready for review February 13, 2024 14:31

the-mikedavis requested a review from kjnilsson February 13, 2024 14:31

kjnilsson approved these changes Feb 13, 2024

View reviewed changes

kjnilsson merged commit a9faaab into main Feb 13, 2024
7 checks passed

the-mikedavis deleted the md-checkpoints branch February 13, 2024 14:55

the-mikedavis mentioned this pull request Feb 13, 2024

Rename ra_log snapshot_interval parameter to min_snapshot_interval #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoints #415

Checkpoints #415

the-mikedavis commented Jan 29, 2024 •

edited

Loading

kjnilsson left a comment

Checkpoints #415

Checkpoints #415

Conversation

the-mikedavis commented Jan 29, 2024 • edited Loading

Footnotes

kjnilsson left a comment

Choose a reason for hiding this comment

the-mikedavis commented Jan 29, 2024 •

edited

Loading