fix: deduplicate stale BTree entries during optimize with stable row IDs by wkalt · Pull Request #6041 · lance-format/lance

wkalt · 2026-02-27T14:13:38Z

When stable row IDs are used, the BTree index stores stable row IDs (not physical addresses) in its _rowid column. During optimize, the old entries for updated rows would survive fragment-based filtering because stable row IDs don't encode fragment IDs in their upper 32 bits. Both old (stale) and new entries for the same row ID ended up in the merged index, causing FlatIndex::try_new to fail with "RowAddrTreeMap:: from_sorted_iter called with non-sorted input" due to duplicate IDs.

Fix: before merging old and new data in combine_old_new(), collect the new data's row IDs and filter old entries that have matching IDs. This removes stale entries for rows that have been updated.

When stable row IDs are used, the BTree index stores stable row IDs (not physical addresses) in its _rowid column. During optimize, the old entries for updated rows would survive fragment-based filtering because stable row IDs don't encode fragment IDs in their upper 32 bits. Both old (stale) and new entries for the same row ID ended up in the merged index, causing FlatIndex::try_new to fail with "RowAddrTreeMap:: from_sorted_iter called with non-sorted input" due to duplicate IDs. Fix: before merging old and new data in combine_old_new(), collect the new data's row IDs and filter old entries that have matching IDs. This removes stale entries for rows that have been updated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wkalt · 2026-02-27T14:14:53Z

This PR was generated automatically from the output of some generative testing

github-actions · 2026-02-27T14:16:14Z

Review

Solid correctness fix for a real bug. The approach — collect new row IDs, then filter stale entries from the old stream — is straightforward and correct.

P1: Consider `RoaringTreemap` instead of `HashSet<u64>`

The codebase convention (per CLAUDE.md) is to prefer RoaringBitmap over HashSet<u32> for memory efficiency. For u64 keys, the equivalent is RoaringTreemap, which is already used in 13+ files across the repo. During a large optimize where many rows were updated, the HashSet<u64> could use significantly more memory than a RoaringTreemap. Since roaring is already a dependency here, the change would be minimal:

use roaring::RoaringTreemap;

let new_row_ids: RoaringTreemap = new_batches
    .iter()
    .flat_map(|batch| { ... })
    .collect();

No other issues found — the test coverage is good and the fix is well-targeted.

codecov · 2026-02-27T14:45:24Z

Codecov Report

❌ Patch coverage is 92.50000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/btree.rs	92.50%	0 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

Replace HashSet<u64> with RoaringTreemap for row ID deduplication per repo coding standards (prefer roaring structures over hash sets for row ID collections). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Xuanwo · 2026-02-27T16:17:06Z

rust/lance-index/src/scalar/btree.rs

+        // new fragment. Without this dedup step, both the old (stale) and new
+        // entries would survive, causing duplicate row IDs in the merged index.
+        let new_schema = new_data.schema();
+        let new_batches: Vec<RecordBatch> = new_data.try_collect().await?;


This will collect all our new data, seems bad.

github-actions bot added the bug Something isn't working label Feb 27, 2026

style: fix formatting + use RoaringTreemap instead of HashSet<u64>

d6d81bd

Replace HashSet<u64> with RoaringTreemap for row ID deduplication per repo coding standards (prefer roaring structures over hash sets for row ID collections). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Xuanwo reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate stale BTree entries during optimize with stable row IDs#6041

fix: deduplicate stale BTree entries during optimize with stable row IDs#6041
wkalt wants to merge 2 commits intolance-format:mainfrom
wkalt:fix/btree-optimize-sort-order

wkalt commented Feb 27, 2026

Uh oh!

wkalt commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

codecov bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Xuanwo Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wkalt commented Feb 27, 2026

Uh oh!

wkalt commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Review

P1: Consider RoaringTreemap instead of HashSet<u64>

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xuanwo Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

P1: Consider `RoaringTreemap` instead of `HashSet<u64>`

codecov bot commented Feb 27, 2026 •

edited

Loading