Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoints #415

Merged
merged 17 commits into from
Feb 13, 2024
Merged

Checkpoints #415

merged 17 commits into from
Feb 13, 2024

Conversation

the-mikedavis
Copy link
Member

@the-mikedavis the-mikedavis commented Jan 29, 2024

As described in #141, we add a new effect that machines may emit:

{checkpoint, CheckpointIndex, MachineState}

This suggests that Ra should add a checkpoint. Checkpoints are essentially the same as snapshots except that they don't trigger log truncation. They reuse the ra_snapshot behavior, so they are exactly the same as snapshots on disk. They can later be promoted to actual snapshots by emitting another new effect:

{release_cursor, ReleaseCursorIndex}

which suggests that Ra should find the checkpoint with the highest index lower than or equal to ReleaseCursorIndex and rename the checkpoint file so that it moves to snapshots/ from checkpoints/. Any checkpoints lower than that index are then deleted, and log can be truncated up to that index. There is a configurable maximum number of checkpoints allowed. Adding more checkpoints after that maximum will trigger a "thinning out" where checkpoints are deleted randomly from the middle of the checkpoint list.

rabbit_fifo currently has an ad-hoc checkpointing system where it queues snapshot effects ({release_cursor, RaftIndex, DehydratedState}) regularly and emits those effects when it moves up the release cursor. The advantage of building this into Ra is that we can store the checkpoints on disk, so we can use them for machine recovery and reduce memory consumption (albeit at the cost of disk / IO).

This is useful for machines like rabbit_fifo that need to keep the log around on disk for a potentially long time (for use with the {log, Idxs, Fun, Opts} effect). If we take checkpoints regularly then recovery becomes constant-time rather than linear on the number of messages in the queue. Testing this locally on my machine against the server with a QQ containing 5 million messages1 gives a recovery time of 10ms while main takes around 24s.

Closes #141

Footnotes

  1. use the md-ra-checkpoints branch, fill the queue with perf-test -x 1 -y 0 -qq -u qq -c 3000 -C 5000000, restart the broker and grep the log file for "recovery of state machine version"

@the-mikedavis the-mikedavis changed the title Add effects for checkpointing Checkpoints Jan 29, 2024
src/ra_snapshot.erl Outdated Show resolved Hide resolved
Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very high quality PR - I did managed to find one thing will need addressing (which to be fair isn't particularly intuitive) as well as a couple of nitpicks.

src/ra_server.erl Outdated Show resolved Hide resolved
src/ra_server.erl Outdated Show resolved Hide resolved
src/ra_machine.erl Show resolved Hide resolved
src/ra_log.erl Outdated Show resolved Hide resolved
src/ra_snapshot.erl Outdated Show resolved Hide resolved
src/ra_log.erl Outdated Show resolved Hide resolved
These definitions mirror the snapshot counter definitions.
Snapshot state already holds nearly everything necessary for checkpoints.
We add a parameter and a field to the record for the checkpoint
directory and adjust some of the `ra_snapshot` API to differentiate
between snapshots and checkpoints.

An alternative approach to this would be to have a separate
`#ra_snapshot{}` state for snapshots and checkpoints. However we would
need to change `#ra_snapshot{}` somewhat substantially to support
multiple snapshots, even though there is only ever one useful snapshot.
It also makes it slightly harder to ensure that there is only one
checkpoint or snapshot being written at a given time, whereas this
strategy can continue to use the `pending` field.
We delete any checkpoints older than the snapshot index. These will
never be useful: it would be faster to recover from a more up-to-date
snapshot. So we can delete the directories and omit them from the
checkpoint list.
This is the same as 'ra_snapshot:current/1' but for checkpoints rather
than snapshots. We will use this to recover from the most recent
checkpoint if it exists and is newer than the snapshot.
This updates the snapshotting codepath to accept a new parameter

    SnapKind :: snapshot | checkpoint

otherwise the snapshotting codepath can be almost entirely reused for
checkpointing.

We don't yet properly handle the `snapshot_written` event and properly
complete either the snapshot or the checkpoint depending on which was
written - that will be covered in the child commit.
`snapshot_written` with a `ra_snapshot:kind()` of `snapshot` retains
the same behavior as before checkpoints. We add a clause to handle
`checkpoint`s though which adds the checkpoint to the `#ra_snapshot{}`
state and thins out any surplus checkpoints.

When we write a regular `snapshot` kind, we also remove any checkpoints
older than the snapshot's index. These older checkpoints have no use
once there's a snapshot newer than them: recovery would only be slower
and there's no point in promoting a checkpoint for a state older than
the current snapshot - it would be a no-op.
This effect allows you to turn a checkpoint into a snapshot. This is a
useful operation for Ra machine's like `rabbit_fifo` (quorum queues)
which read information out of the log. `rabbit_fifo` may only snapshot
up to an index where all enqueue/requeue commands before that index
have been dequeued because snapshotting deletes the enqueue/requeue
commands from the log. So `rabbit_fifo` needs a way to express that it
can snapshot up to a certain index which is not necessarily the current
index when it emits the `release_cursor` effect. Checkpoints contain
states at older indices which are perfect for this - they just need to
be moved from `DataDir/checkpoints/` to `DataDir/snapshots/`. Using a
file rename minimizes the I/O cost of promotion.
This should greatly improve recovery time in degenerate cases for
machines like `rabbit_fifo`. `rabbit_fifo` only snapshots up to indices
where all prior enqueued messages have been consumed. So recovery time
is proportional to the number of outstanding enqueued messages unless
we can recover from a checkpoint. With this change and switching to
checkpoints in `rabbit_fifo`, recovery should be roughly constant-time.

The only part of the code necessary to change for this is
`ra_snapshot:recover/1`. That function would previously recover from
the snapshot if it existed but now it can check to see if there is a
more recent checkpoint. The rest of the machinery in `ra_log`
transparently takes care of recovering from the checkpoint's index.
Initially I have this set at 4x the snapshot interval since you might
spam the `checkpoint` effect more than you would a `release_cursor`
effect.
Snapshots must be fsync'd so that we can safely truncate the log.
There's no need to fsync every checkpoint though: we only care about
a checkpoint surviving its write to disk when we go to promote it.
So we can move the call to `file:sync/1` into the promotion task. By
promotion time it's possible that the file will already be synced.
This is the same helper as I added for `ra_checkpoint_SUITE`. Having
the helper makes it a little easier to add new parameters to
`ra_snapshot:init/X` (see the child commit).
@kjnilsson kjnilsson added this to the 2.10.0 milestone Feb 12, 2024
@the-mikedavis the-mikedavis marked this pull request as ready for review February 13, 2024 14:31
@kjnilsson kjnilsson merged commit a9faaab into main Feb 13, 2024
7 checks passed
@the-mikedavis the-mikedavis deleted the md-checkpoints branch February 13, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ra checkpoints
2 participants