Skip to content

Recovery design document.#13

Merged
tonyastolfi merged 1 commit intomainfrom
tastolfi/recovery-design-doc
Mar 19, 2026
Merged

Recovery design document.#13
tonyastolfi merged 1 commit intomainfrom
tastolfi/recovery-design-doc

Conversation

@tonyastolfi
Copy link
Copy Markdown
Collaborator

@tonyastolfi tonyastolfi commented Feb 18, 2026

@gabrielbornstein
Copy link
Copy Markdown
Collaborator

"The active range marks the contiguous set of
Blocks which are in active use, within a configured number of bytes (default=32MiB; the active range
accuracy ∆"
Is the active range just the size of a memtable?

Is the EditOffset stored persistently for the database? Do we need to update anything to make sure this happens?

Is it important we allow the checkpoint to be variable sizes between restarts? Pros/cons? Any complications? Can we enable checkpoint resizing while the database is running?

Could we save memory by storing the order of edits as a logical clock instead of their EditOffsets? Why are EditOffsets and deltas necessary?

How many MemTables is it possible for us to recover? I thought we only needed to recover one? MemTable size can vary, but max checkpoint size == max mem_table size

Do we persistently store checkpoint upper bound?

"such that any two Blocks bi, bj are in the same Cluster if their EditOffset intervals overlap."
Is it ok if all of our blocks have some overlap? How do we determine how to partition clusters in this case?

Are clusters necessary? Couldn't we just have a priority queue of all the blocks prioritized by the next slots lower bound, and detect when there's a gap in between edits?

@tonyastolfi
Copy link
Copy Markdown
Collaborator Author

"The active range marks the contiguous set of Blocks which are in active use, within a configured number of bytes (default=32MiB; the active range accuracy ∆" Is the active range just the size of a memtable?

It coincides with the recovered MemTable(s), but there may be more than one. There are several reasons for this:

  • The checkpoint distance may have been reduced from its pre-recovery value
  • We intentionally allow for three MemTables to coexist concurrently during Normal Operation: one active (read/write) and up to two finalized (read-only); this is so that we can parallelize three stages of the checkpoint update pipeline:
    1. collection of the next MemTable
    2. updating the checkpoint in-memory
    3. writing the finalized checkpoint to disk

Is the EditOffset stored persistently for the database? Do we need to update anything to make sure this happens?

Which EditOffset? The current design specifies that the EditOffset of every update is stored, as the lower bound EditOffset in each block header together with the 32-bit offset for each slot. The effective upper bound EditOffset is one of the implicit outputs of recovery... I don't think its worth it to track it and write it out explicitly, e.g. in the Meta-Block.

Is it important we allow the checkpoint to be variable sizes between restarts? Pros/cons? Any complications? Can we enable checkpoint resizing while the database is running?

Checkpoint distance, you mean? Yes it is very important that it can vary both between restarts and as the system is running, because this is how we dynamically tune the system to a given workload.

Could we save memory by storing the order of edits as a logical clock instead of their EditOffsets? Why are EditOffsets and deltas necessary?

Technically yes, but practically not really. I think its reasonable to assume that whatever we use for a logical timestamp (EditOffset or some monotonic counter), we should store it in one of the native machine word integer types. The question then is, by omitting the bits we need to capture the slot sizes, can we reduce the required integer type from 64 to 32 bits? Assuming the Block sizes are good, we can shave maybe 11 bits off; 32 + 11 == 43 == ~8 trillion is bigger than we would want the WAL to ever get (in terms of number of edits, not bytes), but I don't really think its big enough to serve as a kind of global time stamp, like for when we want to save snapshots.

How many MemTables is it possible for us to recover? I thought we only needed to recover one? MemTable size can vary, but max checkpoint size == max mem_table size

It depends on the checkpoint distance size at recovery time. Steady-state, we want to have at most 3 MemTables at once: 2 finalized, 1 active. But right after recovery we might have more than that. We could either do nothing about that (just hope the checkpoint update pipeline catches up), throttle updates until we get down to the desired number, or have recover not return until we have written enough checkpoints to catch up. I'm not sure off the top of my head what the trade-offs are in terms of code complexity; we can look at that when we get there.

Do we persistently store checkpoint upper bound?

Yes, each checkpoint (in the checkpoint LLFS volume) stores its upper bound. Right now it is a "batch upper bound;" we should probably change this to be an EditOffset.

"such that any two Blocks bi, bj are in the same Cluster if their EditOffset intervals overlap." Is it ok if all of our blocks have some overlap? How do we determine how to partition clusters in this case?

Section 3.1.2 covers this; we need to periodically update the "minimum lower bound" (min_lb) that is passed each time we append to the log, so that the clusters don't get too big.

Are clusters necessary? Couldn't we just have a priority queue of all the blocks prioritized by the next slots lower bound, and detect when there's a gap in between edits?

I was hoping to avoid having to do this for all but the final cluster. Blocks can have on the order of hundreds of slots, so it's a (potentially) lot more work to sort everything than just the last cluster.

@tonyastolfi tonyastolfi merged commit 96ee40d into main Mar 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants