Conversation
|
"The active range marks the contiguous set of Is the EditOffset stored persistently for the database? Do we need to update anything to make sure this happens? Is it important we allow the checkpoint to be variable sizes between restarts? Pros/cons? Any complications? Can we enable checkpoint resizing while the database is running? Could we save memory by storing the order of edits as a logical clock instead of their EditOffsets? Why are EditOffsets and deltas necessary? How many MemTables is it possible for us to recover? I thought we only needed to recover one? MemTable size can vary, but max checkpoint size == max mem_table size Do we persistently store checkpoint upper bound? "such that any two Blocks bi, bj are in the same Cluster if their EditOffset intervals overlap." Are clusters necessary? Couldn't we just have a priority queue of all the blocks prioritized by the next slots lower bound, and detect when there's a gap in between edits? |
It coincides with the recovered MemTable(s), but there may be more than one. There are several reasons for this:
Which EditOffset? The current design specifies that the EditOffset of every update is stored, as the lower bound EditOffset in each block header together with the 32-bit offset for each slot. The effective upper bound EditOffset is one of the implicit outputs of recovery... I don't think its worth it to track it and write it out explicitly, e.g. in the Meta-Block.
Checkpoint distance, you mean? Yes it is very important that it can vary both between restarts and as the system is running, because this is how we dynamically tune the system to a given workload.
Technically yes, but practically not really. I think its reasonable to assume that whatever we use for a logical timestamp (EditOffset or some monotonic counter), we should store it in one of the native machine word integer types. The question then is, by omitting the bits we need to capture the slot sizes, can we reduce the required integer type from 64 to 32 bits? Assuming the Block sizes are good, we can shave maybe 11 bits off; 32 + 11 == 43 == ~8 trillion is bigger than we would want the WAL to ever get (in terms of number of edits, not bytes), but I don't really think its big enough to serve as a kind of global time stamp, like for when we want to save snapshots.
It depends on the checkpoint distance size at recovery time. Steady-state, we want to have at most 3 MemTables at once: 2 finalized, 1 active. But right after recovery we might have more than that. We could either do nothing about that (just hope the checkpoint update pipeline catches up), throttle updates until we get down to the desired number, or have recover not return until we have written enough checkpoints to catch up. I'm not sure off the top of my head what the trade-offs are in terms of code complexity; we can look at that when we get there.
Yes, each checkpoint (in the checkpoint LLFS volume) stores its upper bound. Right now it is a "batch upper bound;" we should probably change this to be an EditOffset.
Section 3.1.2 covers this; we need to periodically update the "minimum lower bound" (
I was hoping to avoid having to do this for all but the final cluster. Blocks can have on the order of hundreds of slots, so it's a (potentially) lot more work to sort everything than just the last cluster. |
PDF render available here: https://storage.googleapis.com/pdf-renders/recovery.pdf