-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoints #415
Merged
Merged
Checkpoints #415
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
the-mikedavis
commented
Jan 29, 2024
the-mikedavis
force-pushed
the
md-checkpoints
branch
from
January 29, 2024 16:45
19c648a
to
7e5ae32
Compare
the-mikedavis
force-pushed
the
md-checkpoints
branch
2 times, most recently
from
February 5, 2024 20:16
2e3f112
to
7d808e4
Compare
kjnilsson
requested changes
Feb 12, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very high quality PR - I did managed to find one thing will need addressing (which to be fair isn't particularly intuitive) as well as a couple of nitpicks.
These definitions mirror the snapshot counter definitions.
Snapshot state already holds nearly everything necessary for checkpoints. We add a parameter and a field to the record for the checkpoint directory and adjust some of the `ra_snapshot` API to differentiate between snapshots and checkpoints. An alternative approach to this would be to have a separate `#ra_snapshot{}` state for snapshots and checkpoints. However we would need to change `#ra_snapshot{}` somewhat substantially to support multiple snapshots, even though there is only ever one useful snapshot. It also makes it slightly harder to ensure that there is only one checkpoint or snapshot being written at a given time, whereas this strategy can continue to use the `pending` field.
We delete any checkpoints older than the snapshot index. These will never be useful: it would be faster to recover from a more up-to-date snapshot. So we can delete the directories and omit them from the checkpoint list.
This is the same as 'ra_snapshot:current/1' but for checkpoints rather than snapshots. We will use this to recover from the most recent checkpoint if it exists and is newer than the snapshot.
This updates the snapshotting codepath to accept a new parameter SnapKind :: snapshot | checkpoint otherwise the snapshotting codepath can be almost entirely reused for checkpointing. We don't yet properly handle the `snapshot_written` event and properly complete either the snapshot or the checkpoint depending on which was written - that will be covered in the child commit.
`snapshot_written` with a `ra_snapshot:kind()` of `snapshot` retains the same behavior as before checkpoints. We add a clause to handle `checkpoint`s though which adds the checkpoint to the `#ra_snapshot{}` state and thins out any surplus checkpoints. When we write a regular `snapshot` kind, we also remove any checkpoints older than the snapshot's index. These older checkpoints have no use once there's a snapshot newer than them: recovery would only be slower and there's no point in promoting a checkpoint for a state older than the current snapshot - it would be a no-op.
This effect allows you to turn a checkpoint into a snapshot. This is a useful operation for Ra machine's like `rabbit_fifo` (quorum queues) which read information out of the log. `rabbit_fifo` may only snapshot up to an index where all enqueue/requeue commands before that index have been dequeued because snapshotting deletes the enqueue/requeue commands from the log. So `rabbit_fifo` needs a way to express that it can snapshot up to a certain index which is not necessarily the current index when it emits the `release_cursor` effect. Checkpoints contain states at older indices which are perfect for this - they just need to be moved from `DataDir/checkpoints/` to `DataDir/snapshots/`. Using a file rename minimizes the I/O cost of promotion.
This should greatly improve recovery time in degenerate cases for machines like `rabbit_fifo`. `rabbit_fifo` only snapshots up to indices where all prior enqueued messages have been consumed. So recovery time is proportional to the number of outstanding enqueued messages unless we can recover from a checkpoint. With this change and switching to checkpoints in `rabbit_fifo`, recovery should be roughly constant-time. The only part of the code necessary to change for this is `ra_snapshot:recover/1`. That function would previously recover from the snapshot if it existed but now it can check to see if there is a more recent checkpoint. The rest of the machinery in `ra_log` transparently takes care of recovering from the checkpoint's index.
Initially I have this set at 4x the snapshot interval since you might spam the `checkpoint` effect more than you would a `release_cursor` effect.
Snapshots must be fsync'd so that we can safely truncate the log. There's no need to fsync every checkpoint though: we only care about a checkpoint surviving its write to disk when we go to promote it. So we can move the call to `file:sync/1` into the promotion task. By promotion time it's possible that the file will already be synced.
This is the same helper as I added for `ra_checkpoint_SUITE`. Having the helper makes it a little easier to add new parameters to `ra_snapshot:init/X` (see the child commit).
the-mikedavis
force-pushed
the
md-checkpoints
branch
from
February 12, 2024 15:51
7d808e4
to
47e0733
Compare
kjnilsson
approved these changes
Feb 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As described in #141, we add a new effect that machines may emit:
This suggests that Ra should add a checkpoint. Checkpoints are essentially the same as snapshots except that they don't trigger log truncation. They reuse the
ra_snapshot
behavior, so they are exactly the same as snapshots on disk. They can later be promoted to actual snapshots by emitting another new effect:which suggests that Ra should find the checkpoint with the highest index lower than or equal to
ReleaseCursorIndex
and rename the checkpoint file so that it moves tosnapshots/
fromcheckpoints/
. Any checkpoints lower than that index are then deleted, and log can be truncated up to that index. There is a configurable maximum number of checkpoints allowed. Adding more checkpoints after that maximum will trigger a "thinning out" where checkpoints are deleted randomly from the middle of the checkpoint list.rabbit_fifo
currently has an ad-hoc checkpointing system where it queues snapshot effects ({release_cursor, RaftIndex, DehydratedState}
) regularly and emits those effects when it moves up the release cursor. The advantage of building this into Ra is that we can store the checkpoints on disk, so we can use them for machine recovery and reduce memory consumption (albeit at the cost of disk / IO).This is useful for machines like
rabbit_fifo
that need to keep the log around on disk for a potentially long time (for use with the{log, Idxs, Fun, Opts}
effect). If we take checkpoints regularly then recovery becomes constant-time rather than linear on the number of messages in the queue. Testing this locally on my machine against the server with a QQ containing 5 million messages1 gives a recovery time of 10ms whilemain
takes around 24s.Closes #141
Footnotes
use the
md-ra-checkpoints
branch, fill the queue withperf-test -x 1 -y 0 -qq -u qq -c 3000 -C 5000000
, restart the broker and grep the log file for "recovery of state machine version" ↩