feat(kmeans): add on-disk training data option to reduce memory usage by eddyxu · Pull Request #6017 · lance-format/lance

eddyxu · 2026-02-25T21:03:30Z

Add on_disk option to KMeansParams that spills the training sample to a memory-mapped temp file instead of keeping it in RAM. This reduces memory pressure during large-scale IVF index training while keeping the core algorithm unchanged.

Changes

Add memmap2 workspace dependency
Add pub on_disk: bool field to KMeansParams (default false, no breaking change) with with_on_disk(bool) builder method
Introduce internal KMeansDataBuffer<T> enum with InMemory and OnDisk variants; both expose training data as &[T] so the core algorithm (membership assignment, centroid update) is unaffected
Write training sample to a NamedTempFile and mmap it when on_disk = true; OS page cache keeps hot pages in memory
Add bench_train_disk_vs_memory benchmark (128-dim, 512 clusters)

Benchmark results (128-dim, 512 clusters, 10 samples)

Variant	Time (median)	Range
in_memory	907.77 ms	889–926 ms
on_disk	920.81 ms	903–938 ms

Overhead: ~1.4% — well within the 20% target. The mmap approach provides near-zero overhead because the OS page cache keeps the training sample in memory across iterations.

Disk space estimate for production workloads

disk_GB ≈ (256 × k × dim × 4) / 1e9

k=65K, dim=1024, float32: ~68 GB temp space
k=262K, dim=1024, float32: ~274 GB temp space

The work is done by GH copilot.

Add `on_disk` option to `KMeansParams` that spills the training sample to a memory-mapped temp file instead of keeping it in RAM. This reduces memory pressure during large-scale IVF index training while keeping the core algorithm unchanged. ## Changes - Add `memmap2` workspace dependency - Add `pub on_disk: bool` field to `KMeansParams` (default `false`, no breaking change) with `with_on_disk(bool)` builder method - Introduce internal `KMeansDataBuffer<T>` enum with `InMemory` and `OnDisk` variants; both expose training data as `&[T]` so the core algorithm (membership assignment, centroid update) is unaffected - Write training sample to a `NamedTempFile` and `mmap` it when `on_disk = true`; OS page cache keeps hot pages in memory - Add `bench_train_disk_vs_memory` benchmark (128-dim, 512 clusters) ## Benchmark results (128-dim, 512 clusters, 10 samples) | Variant | Time (median) | Range | |------------|---------------|------------------| | in_memory | 907.77 ms | 889–926 ms | | on_disk | 920.81 ms | 903–938 ms | Overhead: ~1.4% — well within the 20% target. The mmap approach provides near-zero overhead because the OS page cache keeps the training sample in memory across iterations. ## Disk space estimate for production workloads disk_GB ≈ (256 × k × dim × 4) / 1e9 - k=65K, dim=1024, float32: ~68 GB temp space - k=262K, dim=1024, float32: ~274 GB temp space Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-02-25T21:04:47Z

Code Review

P1: Missing unit tests

The project guidelines state: "We do not merge code without tests." This PR adds benchmarks but no unit tests for the new on_disk functionality.

Please add at least:

A test verifying on_disk: true produces the same centroids as on_disk: false
Consider adding a test for KMeansDataBuffer::on_disk() edge cases (e.g., error handling)

Example test pattern from the existing tests:

#[tokio::test]
async fn test_on_disk_produces_same_results() {
    const DIM: usize = 8;
    const K: usize = 4;
    const NUM_VALUES: usize = 256 * K;
    
    let values = generate_random_array(NUM_VALUES * DIM);
    let fsl = FixedSizeListArray::try_new_from_values(values, DIM as i32).unwrap();
    
    let params_memory = KMeansParams::default().with_on_disk(false);
    let params_disk = KMeansParams::default().with_on_disk(true);
    
    // Use same seed for deterministic comparison
    let kmeans_memory = KMeans::new_with_params(&fsl, K, &params_memory).unwrap();
    let kmeans_disk = KMeans::new_with_params(&fsl, K, &params_disk).unwrap();
    
    // Both should produce valid centroids
    assert_eq!(kmeans_memory.dimension, kmeans_disk.dimension);
    assert_eq!(kmeans_memory.centroids.len(), kmeans_disk.centroids.len());
}

P1: Consider simpler design for `KMeansDataBuffer::InMemory`

The InMemory variant uses a raw *const T pointer with manual unsafe impl Send/Sync, but the buffer is created and consumed within the same function scope in train_kmeans. A simpler approach would be to just keep the reference directly:

enum KMeansDataBuffer<'a, T> {
    InMemory(&'a [T]),
    OnDisk { _file: NamedTempFile, mmap: Mmap, phantom: PhantomData<T> },
}

This avoids raw pointers entirely and lets the compiler verify safety. If lifetime propagation is difficult, an alternative is to use Arc<[T]> for the in-memory case.

codecov · 2026-02-25T21:43:52Z

Codecov Report

❌ Patch coverage is 37.25490% with 32 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/vector/kmeans.rs	37.25%	32 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions bot added the enhancement New feature or request label Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kmeans): add on-disk training data option to reduce memory usage#6017

feat(kmeans): add on-disk training data option to reduce memory usage#6017
eddyxu wants to merge 1 commit intomainfrom
feat/kmeans-on-disk-training

eddyxu commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

codecov bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eddyxu commented Feb 25, 2026

Changes

Benchmark results (128-dim, 512 clusters, 10 samples)

Disk space estimate for production workloads

Uh oh!

github-actions bot commented Feb 25, 2026

Code Review

P1: Missing unit tests

P1: Consider simpler design for KMeansDataBuffer::InMemory

Uh oh!

codecov bot commented Feb 25, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

P1: Consider simpler design for `KMeansDataBuffer::InMemory`