feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflicts by zhangyue19921010 · Pull Request #6004 · lance-format/lance

zhangyue19921010 · 2026-02-25T08:08:28Z

Background

At the moment, compaction calls reserve_fragment_ids in each task, and each call is effectively a commit (ReserveFragments transaction).

With large datasets (for example, compaction plans with up to 10K tasks), especially under concurrent compaction, this creates a large number of tiny commits.
As a result, we repeatedly trigger transaction conflict checks, which can significantly hurt performance and may even cause task starvation (some tasks take a very long time to finish).

What this PR introduces

This PR adds prealloc_fragment_ids for compaction.

When enabled:

During compaction planning, we estimate the total number of output fragments.
We reserve fragment IDs upfront (in bulk) before task execution.
Reserved IDs are distributed to compaction tasks and consumed as needed.

This reduces commit amplification from per-task reservation to mostly plan-level reservation.

Compatibility and safety guarantees

A new switch (prealloc_fragment_ids) controls this behavior.
- Default behavior remains unchanged when the switch is off.
A new parameter (fragment_id_prealloc_factor) allows over-reservation to provide a safety margin.
If estimation is not accurate and reserved IDs are exhausted, tasks detect it and perform a fallback reserve_fragment_ids call to continue safely.

Expected impact

Far fewer tiny commits during large compaction jobs.
Lower transaction conflict pressure under concurrency.

For example, the following is an overview of the number of commits for compacting 1000 fragments, with a total of 100 commits.

drwxr-xr-x@ 104 1   staff    3328  2 25 15:48 ./
drwxr-xr-x@   5 1   staff     160  2 25 15:47 ../
-rw-r--r--@   1 1   staff  118307  2 25 15:48 18446744073709551513.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551514.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551515.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551516.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551517.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551518.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551519.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551520.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551521.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551522.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551523.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551524.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551525.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551526.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551527.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551528.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551529.manifest


.....



-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551609.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551610.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551611.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551612.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551613.manifest
-rw-r--r--@   1 1   staff  193779  2 25 15:47 18446744073709551614.manifest

After optimization, only two times are needed (Pre allocate fragments ids commit + Compaction commit)

drwxr-xr-x@ 5 1   staff     160  2 25 15:50 ./
drwxr-xr-x@ 5 1   staff     160  2 25 15:50 ../
-rw-r--r--@ 1 1   staff  118307  2 25 15:50 18446744073709551612.manifest
-rw-r--r--@ 1 1   staff   98407  2 25 15:50 18446744073709551613.manifest
-rw-r--r--@ 1 1   staff  193779  2 25 15:50 18446744073709551614.manifest

github-actions · 2026-02-25T08:08:47Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

hamersaw

I think it would be useful to explore post-allocation of fragment IDs as an alternative to this pre-allocation solution:

Pre-allocation: We estimate the number of resulting fragments and do a single call to allocate fragment IDs. Since the fragment IDs are known during compaction, we can distribute transpose row_id operations (if enabled).

Pros: If we estimate correctly, this is great!
Cons: Incorrect estimates lead to overhead in (1) under-estimation causes additional reserve_fragment_ids invocations per task (so same as current approach) or (2) over-estimation results in unused fragment IDs (probably not a large concern, but merits calling out).

Post-allocation: We wait until all fragments have been written and then allocate fragment IDs. This will require a post-processing step for transposing row_id.

Pros: We do not have to estimate fragment counts, because these are known. So in every case a single reserve-fragment-ids call.
Cons: We can not distribute transposing row_id operations; for large datasets (e.g., 1b+ rows) this can be order of seconds I think).

IMO the call on whether to pre-allocate or post-allocation is a tradeoff between (1) our confidence level in the estimation accuracy and (2) the cost of transposing row_ids.

hamersaw · 2026-03-02T15:15:08Z

java/lance-jni/src/optimize.rs

+            estimated_output_fragment_count: None,
+            reserved_start: None,
+            reserved_len: None,


This is just to make compilation succeed? Should we expose this in python / Java SDKs as first-class? Otherwise we can take it on as a follow up!

hamersaw · 2026-03-02T15:17:33Z

rust/lance/src/dataset/optimize.rs

+    /// default false
+    pub prealloc_fragment_ids: bool,
+    /// Expansion factor applied to preallocated fragment IDs for a compaction plan.
+    /// default 1.05


Not sure if the prealloc_fragment_factor is useful. Basically, the algorithm assigns the estimated fragment count to each TaskData and then the last one takes the rest. So in practice, if there are 100 TaskData instances each with 2 fragments and the factor is 1.05 (210 reserved), 99 TaskData will receive 2 assigned, and the last TaskData will get the remaining 12 reserved fragment IDs. So I would challenge that this configuration gives us the ability to tune unused over allocation and addition reservation overhead.

Nice Catch! In fact, we can apply this magnification factor to the granularity of Task allocation. Taking your example, that would be 2 * 1.05 => 3, and finally pre-allocation for fragment ids according to the total number of finally allocated fragments.

hamersaw · 2026-03-02T15:21:32Z

rust/lance/src/dataset/optimize.rs

    })
 }

+fn sum_file_size(fragment: &Fragment) -> usize {


I think this is only used once, does it make sense to define a separate function for it?

hamersaw · 2026-03-02T17:12:23Z

Would also like to comment here praising potential perf improvements of this work for large compactions. I ran a test on compacting a 100k fragment dataset (16 rows per fragment) down to 144 fragments with 24 compactions in parallel. Without this it took 1h53m and with pre-allocating fragment IDs it took 21m45s. So very significant!

zhangyue19921010 · 2026-03-03T02:32:28Z

Hi @hamersaw Thanks a lot for your attention for this issue.
In general, I have previously struggled with whether to use Pre-allocation or Post-allocation. The main concerns about Post-allocation are:

For the calculation of row_id_map, it will degrade from the original parallel computing to serial computing. In scenarios with large amounts of data, this may become another bottleneck. For computing with large amounts of rows, it might still be necessary to keep parallel processing.

Maybe hot path is related to follow code which need to access every rows and fragments compaction involved.

lance/rust/lance/src/dataset/optimize/remapping.rs

Line 157 in 39f8a26

pub fn transpose_row_addrs(

It will modify the data structure of RewriteResult, that is, it will no longer contain the row_id_map information, which may have an impact on historical users.

From the perspective of Pre-allocation, the compatibility is better, but it also indeed cannot guarantee the continuity of fragment IDs. I have had a brief discussion with @jackye1995 about this ：）

Hi @wjones127, @jackye1995 and @Xuanwo Could you please share some thoughts? Thank you !

westonpace

Here's my two cents:

I am not concerned about the cost of the transpose. I do not think this should be significant when compared with all the other work that needs to be done. If needed we can parallelize across threads but even then I wouldn't bother.
I am not worried about a few lost fragment ids from an over-estimate. We can have fragment gaps already due to failed compaction and I know we encounter this often in production.
I am not worried about breaking changes to RewriteResult. These are fairly temporary and this would only affect a compaction that starts on one version and commits on another which seems rare and unlikely.
In general, I think our estimates will probably be pretty good
I am slightly worried about out-of-order fragment ids due to an under-estimate. I just don't know if we happen to make any implicit assumptions that fragment ids are always ascending and bugs from this can be subtle.
I do think the post-allocation is slightly less complex.

Given this, I lean slightly towards post-allocation.

westonpace · 2026-03-03T14:52:02Z

rust/lance/src/dataset/optimize.rs

+                    let live_rows = bin.row_counts.iter().copied().sum::<usize>();
+                    let total_input_file_size =
+                        bin.fragments.iter().map(sum_file_size).sum::<usize>();
+                    let estimated_output_fragment_count = estimate_output_fragment_num(


Why is this an estimate? Is it because we might hit the size (bytes) limit when writing and end up with more fragments than expected?

westonpace · 2026-03-03T14:52:25Z

rust/lance/src/dataset/optimize.rs

    pub binary_copy_read_batch_bytes: Option<usize>,
+    /// Whether to preallocate fragment IDs for a compaction plan.
+    /// default false
+    pub prealloc_fragment_ids: bool,


Why would I ever want this set to false?

Xuanwo

Thank you @zhangyue19921010 for your work on this! I believe this PR actually combines two aspects: fragment ID allocation and the distribution of remap tasks.

My current idea is that we could first add post reserve. Then, we can adapt this PR to implement the remap task distribution based on that approach.

What are your thoughts? @zhangyue19921010 @hamersaw

zhangyue19921010 · 2026-03-04T12:41:38Z

First of all, thanks for all your discuss and attention.

After careful consideration, post-processing is indeed a more concise solution. I will close this PR. Thanks for all your effort here @hamersaw @westonpace @Xuanwo 👍

codecov · 2026-03-04T12:44:53Z

Codecov Report

❌ Patch coverage is 88.26979% with 40 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/optimize.rs	88.26%	33 Missing and 7 partials ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo · 2026-03-04T13:46:25Z

Thank you @zhangyue19921010 again!

YueZhang added 3 commits February 24, 2026 16:32

feat: pre allocate fragement ids during compaction

5832db7

need more unit test

3ea51bb

add more tests

de685f3

github-actions bot added enhancement New feature or request java labels Feb 25, 2026

zhangyue19921010 changed the title ~~feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflict checking~~ feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflicts Feb 25, 2026

wjones127 self-assigned this Feb 25, 2026

hamersaw reviewed Mar 2, 2026

View reviewed changes

hamersaw mentioned this pull request Mar 2, 2026

Reduce thundering herd of reserve fragment ID calls during parallelized compaction #6075

Closed

zhangyue19921010 mentioned this pull request Mar 3, 2026

feat(compaction): single reserve_fragment_ids after rewriting files #6072

Merged

westonpace reviewed Mar 3, 2026

View reviewed changes

Xuanwo reviewed Mar 4, 2026

View reviewed changes

zhangyue19921010 closed this Mar 4, 2026

Conversation

zhangyue19921010 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

What this PR introduces

Compatibility and safety guarantees

Expected impact

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

hamersaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamersaw Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw commented Mar 2, 2026

Uh oh!

zhangyue19921010 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 commented Mar 4, 2026

Uh oh!

codecov bot commented Mar 4, 2026

Codecov Report

Uh oh!

Xuanwo commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhangyue19921010 commented Feb 25, 2026 •

edited

Loading

hamersaw left a comment •

edited

Loading

zhangyue19921010 commented Mar 3, 2026 •

edited

Loading