fix: deduplicate stale BTree entries during optimize with stable row IDs#6041
fix: deduplicate stale BTree entries during optimize with stable row IDs#6041wkalt wants to merge 2 commits intolance-format:mainfrom
Conversation
When stable row IDs are used, the BTree index stores stable row IDs (not physical addresses) in its _rowid column. During optimize, the old entries for updated rows would survive fragment-based filtering because stable row IDs don't encode fragment IDs in their upper 32 bits. Both old (stale) and new entries for the same row ID ended up in the merged index, causing FlatIndex::try_new to fail with "RowAddrTreeMap:: from_sorted_iter called with non-sorted input" due to duplicate IDs. Fix: before merging old and new data in combine_old_new(), collect the new data's row IDs and filter old entries that have matching IDs. This removes stale entries for rows that have been updated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR was generated automatically from the output of some generative testing |
ReviewSolid correctness fix for a real bug. The approach — collect new row IDs, then filter stale entries from the old stream — is straightforward and correct. P1: Consider
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Replace HashSet<u64> with RoaringTreemap for row ID deduplication per repo coding standards (prefer roaring structures over hash sets for row ID collections). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // new fragment. Without this dedup step, both the old (stale) and new | ||
| // entries would survive, causing duplicate row IDs in the merged index. | ||
| let new_schema = new_data.schema(); | ||
| let new_batches: Vec<RecordBatch> = new_data.try_collect().await?; |
There was a problem hiding this comment.
This will collect all our new data, seems bad.
When stable row IDs are used, the BTree index stores stable row IDs (not physical addresses) in its _rowid column. During optimize, the old entries for updated rows would survive fragment-based filtering because stable row IDs don't encode fragment IDs in their upper 32 bits. Both old (stale) and new entries for the same row ID ended up in the merged index, causing FlatIndex::try_new to fail with "RowAddrTreeMap:: from_sorted_iter called with non-sorted input" due to duplicate IDs.
Fix: before merging old and new data in combine_old_new(), collect the new data's row IDs and filter old entries that have matching IDs. This removes stale entries for rows that have been updated.