storage: refactor upsert source to use simplified control flow#35407
storage: refactor upsert source to use simplified control flow#35407DAlperin wants to merge 3 commits intoMaterializeInc:mainfrom
Conversation
Instead of doing partial timestamp omission we now rely on a rocksdb cache to consolidate as we go, only ommitting full data. The I/O load should be about the same since we had to go to rocksb anyway for each key during partial emission. We could potentially optimize further by having an in memory stash if data is small and only spilling to rocks after some threshold but in absense of evidence I'm avoiding the complication. Further we could conceivably update our Rocks wrapper to support multiple CF, but again, I don't think it's super worth it for now.
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
|
bugbot run |
PR SummaryMedium Risk Overview Adds a RocksDB-backed on-disk Cleans up related configuration/metrics/test plumbing by removing the Written by Cursor Bugbot for commit a2a9cf0. This will update automatically on new commits. Configure here. |
src/storage/src/upsert/stash.rs
Outdated
| self.rocksdb.multi_update(batch).await?; | ||
| } | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
Stash bookkeeping diverges from RocksDB on write failure
Medium Severity
In stage_batch, the in-memory time_info map (count and prefix) is updated eagerly inside the for loop before the fallible self.rocksdb.multi_update(batch) call. If the RocksDB write fails and returns Err, the ? propagates the error but time_info already reflects the staged entries. This causes is_empty(), len(), and min_time() to return incorrect results. In the caller (upsert_continual_feedback), this leads to holding a stash_cap capability for phantom entries that don't exist in RocksDB, which can temporarily block frontier advancement. The inconsistency self-heals on the next drain (the prefix scan finds nothing and removes the phantom time entry), but the window of incorrect state could delay processing.
keep it in memory.
3ef9d6a to
6792b26
Compare


Instead of doing partial timestamp omission we now rely on a rocksdb
cache to consolidate as we go, only omitting full data. The I/O load
should be about the same since we had to go to rocksb anyway for each
key during partial emission.
We could potentially optimize further by having an in memory stash if
data is small and only spilling to rocks after some threshold but in
absence of evidence I'm avoiding the complication.
Further we could conceivably update our Rocks wrapper to support
multiple CF, but again, I don't think it's super worth it for now.