Skip to content

feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflicts#6004

Closed
zhangyue19921010 wants to merge 3 commits intolance-format:mainfrom
zhangyue19921010:pre-alloc-frags-compaction-v2
Closed

feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflicts#6004
zhangyue19921010 wants to merge 3 commits intolance-format:mainfrom
zhangyue19921010:pre-alloc-frags-compaction-v2

Conversation

@zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Feb 25, 2026

Background

At the moment, compaction calls reserve_fragment_ids in each task, and each call is effectively a commit (ReserveFragments transaction).

With large datasets (for example, compaction plans with up to 10K tasks), especially under concurrent compaction, this creates a large number of tiny commits.
As a result, we repeatedly trigger transaction conflict checks, which can significantly hurt performance and may even cause task starvation (some tasks take a very long time to finish).

What this PR introduces

This PR adds prealloc_fragment_ids for compaction.

When enabled:

  1. During compaction planning, we estimate the total number of output fragments.
  2. We reserve fragment IDs upfront (in bulk) before task execution.
  3. Reserved IDs are distributed to compaction tasks and consumed as needed.

This reduces commit amplification from per-task reservation to mostly plan-level reservation.

Compatibility and safety guarantees

  1. A new switch (prealloc_fragment_ids) controls this behavior.
    • Default behavior remains unchanged when the switch is off.
  2. A new parameter (fragment_id_prealloc_factor) allows over-reservation to provide a safety margin.
  3. If estimation is not accurate and reserved IDs are exhausted, tasks detect it and perform a fallback reserve_fragment_ids call to continue safely.

Expected impact

  • Far fewer tiny commits during large compaction jobs.
  • Lower transaction conflict pressure under concurrency.

For example, the following is an overview of the number of commits for compacting 1000 fragments, with a total of 100 commits.

drwxr-xr-x@ 104 1   staff    3328  2 25 15:48 ./
drwxr-xr-x@   5 1   staff     160  2 25 15:47 ../
-rw-r--r--@   1 1   staff  118307  2 25 15:48 18446744073709551513.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551514.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551515.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551516.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551517.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551518.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551519.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551520.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551521.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551522.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551523.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551524.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551525.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551526.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551527.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551528.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551529.manifest


.....



-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551609.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551610.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551611.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551612.manifest
-rw-r--r--@   1 1   staff   98407  2 25 15:48 18446744073709551613.manifest
-rw-r--r--@   1 1   staff  193779  2 25 15:47 18446744073709551614.manifest

After optimization, only two times are needed (Pre allocate fragments ids commit + Compaction commit)

drwxr-xr-x@ 5 1   staff     160  2 25 15:50 ./
drwxr-xr-x@ 5 1   staff     160  2 25 15:50 ../
-rw-r--r--@ 1 1   staff  118307  2 25 15:50 18446744073709551612.manifest
-rw-r--r--@ 1 1   staff   98407  2 25 15:50 18446744073709551613.manifest
-rw-r--r--@ 1 1   staff  193779  2 25 15:50 18446744073709551614.manifest

@github-actions github-actions bot added enhancement New feature or request java labels Feb 25, 2026
@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@zhangyue19921010 zhangyue19921010 changed the title feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflict checking feat(compaction): preallocate fragment IDs to reduce reserve-commit amplification and conflicts Feb 25, 2026
@wjones127 wjones127 self-assigned this Feb 25, 2026
Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to explore post-allocation of fragment IDs as an alternative to this pre-allocation solution:

Pre-allocation: We estimate the number of resulting fragments and do a single call to allocate fragment IDs. Since the fragment IDs are known during compaction, we can distribute transpose row_id operations (if enabled).

  • Pros: If we estimate correctly, this is great!
  • Cons: Incorrect estimates lead to overhead in (1) under-estimation causes additional reserve_fragment_ids invocations per task (so same as current approach) or (2) over-estimation results in unused fragment IDs (probably not a large concern, but merits calling out).

Post-allocation: We wait until all fragments have been written and then allocate fragment IDs. This will require a post-processing step for transposing row_id.

  • Pros: We do not have to estimate fragment counts, because these are known. So in every case a single reserve-fragment-ids call.
  • Cons: We can not distribute transposing row_id operations; for large datasets (e.g., 1b+ rows) this can be order of seconds I think).

IMO the call on whether to pre-allocate or post-allocation is a tradeoff between (1) our confidence level in the estimation accuracy and (2) the cost of transposing row_ids.

Comment on lines +413 to +415
estimated_output_fragment_count: None,
reserved_start: None,
reserved_len: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to make compilation succeed? Should we expose this in python / Java SDKs as first-class? Otherwise we can take it on as a follow up!

/// default false
pub prealloc_fragment_ids: bool,
/// Expansion factor applied to preallocated fragment IDs for a compaction plan.
/// default 1.05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the prealloc_fragment_factor is useful. Basically, the algorithm assigns the estimated fragment count to each TaskData and then the last one takes the rest. So in practice, if there are 100 TaskData instances each with 2 fragments and the factor is 1.05 (210 reserved), 99 TaskData will receive 2 assigned, and the last TaskData will get the remaining 12 reserved fragment IDs. So I would challenge that this configuration gives us the ability to tune unused over allocation and addition reservation overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Catch! In fact, we can apply this magnification factor to the granularity of Task allocation. Taking your example, that would be 2 * 1.05 => 3, and finally pre-allocation for fragment ids according to the total number of finally allocated fragments.

})
}

fn sum_file_size(fragment: &Fragment) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only used once, does it make sense to define a separate function for it?

@hamersaw
Copy link
Contributor

hamersaw commented Mar 2, 2026

Would also like to comment here praising potential perf improvements of this work for large compactions. I ran a test on compacting a 100k fragment dataset (16 rows per fragment) down to 144 fragments with 24 compactions in parallel. Without this it took 1h53m and with pre-allocating fragment IDs it took 21m45s. So very significant!

@zhangyue19921010
Copy link
Contributor Author

zhangyue19921010 commented Mar 3, 2026

Hi @hamersaw Thanks a lot for your attention for this issue.
In general, I have previously struggled with whether to use Pre-allocation or Post-allocation. The main concerns about Post-allocation are:

  • For the calculation of row_id_map, it will degrade from the original parallel computing to serial computing. In scenarios with large amounts of data, this may become another bottleneck. For computing with large amounts of rows, it might still be necessary to keep parallel processing.

Maybe hot path is related to follow code which need to access every rows and fragments compaction involved.

pub fn transpose_row_addrs(

  • It will modify the data structure of RewriteResult, that is, it will no longer contain the row_id_map information, which may have an impact on historical users.

From the perspective of Pre-allocation, the compatibility is better, but it also indeed cannot guarantee the continuity of fragment IDs. I have had a brief discussion with @jackye1995 about this :)

Hi @wjones127, @jackye1995 and @Xuanwo Could you please share some thoughts? Thank you !

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my two cents:

  • I am not concerned about the cost of the transpose. I do not think this should be significant when compared with all the other work that needs to be done. If needed we can parallelize across threads but even then I wouldn't bother.
  • I am not worried about a few lost fragment ids from an over-estimate. We can have fragment gaps already due to failed compaction and I know we encounter this often in production.
  • I am not worried about breaking changes to RewriteResult. These are fairly temporary and this would only affect a compaction that starts on one version and commits on another which seems rare and unlikely.
  • In general, I think our estimates will probably be pretty good
  • I am slightly worried about out-of-order fragment ids due to an under-estimate. I just don't know if we happen to make any implicit assumptions that fragment ids are always ascending and bugs from this can be subtle.
  • I do think the post-allocation is slightly less complex.

Given this, I lean slightly towards post-allocation.

let live_rows = bin.row_counts.iter().copied().sum::<usize>();
let total_input_file_size =
bin.fragments.iter().map(sum_file_size).sum::<usize>();
let estimated_output_fragment_count = estimate_output_fragment_num(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an estimate? Is it because we might hit the size (bytes) limit when writing and end up with more fragments than expected?

pub binary_copy_read_batch_bytes: Option<usize>,
/// Whether to preallocate fragment IDs for a compaction plan.
/// default false
pub prealloc_fragment_ids: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would I ever want this set to false?

Copy link
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhangyue19921010 for your work on this! I believe this PR actually combines two aspects: fragment ID allocation and the distribution of remap tasks.

My current idea is that we could first add post reserve. Then, we can adapt this PR to implement the remap task distribution based on that approach.

What are your thoughts? @zhangyue19921010 @hamersaw

@zhangyue19921010
Copy link
Contributor Author

First of all, thanks for all your discuss and attention.

After careful consideration, post-processing is indeed a more concise solution. I will close this PR. Thanks for all your effort here @hamersaw @westonpace @Xuanwo 👍

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 88.26979% with 40 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/optimize.rs 88.26% 33 Missing and 7 partials ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo
Copy link
Collaborator

Xuanwo commented Mar 4, 2026

Thank you @zhangyue19921010 again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants