Skip to content

feat(kmeans): add on-disk training data option to reduce memory usage#6017

Draft
eddyxu wants to merge 1 commit intomainfrom
feat/kmeans-on-disk-training
Draft

feat(kmeans): add on-disk training data option to reduce memory usage#6017
eddyxu wants to merge 1 commit intomainfrom
feat/kmeans-on-disk-training

Conversation

@eddyxu
Copy link
Member

@eddyxu eddyxu commented Feb 25, 2026

Add on_disk option to KMeansParams that spills the training sample to a memory-mapped temp file instead of keeping it in RAM. This reduces memory pressure during large-scale IVF index training while keeping the core algorithm unchanged.

Changes

  • Add memmap2 workspace dependency
  • Add pub on_disk: bool field to KMeansParams (default false, no breaking change) with with_on_disk(bool) builder method
  • Introduce internal KMeansDataBuffer<T> enum with InMemory and OnDisk variants; both expose training data as &[T] so the core algorithm (membership assignment, centroid update) is unaffected
  • Write training sample to a NamedTempFile and mmap it when on_disk = true; OS page cache keeps hot pages in memory
  • Add bench_train_disk_vs_memory benchmark (128-dim, 512 clusters)

Benchmark results (128-dim, 512 clusters, 10 samples)

Variant Time (median) Range
in_memory 907.77 ms 889–926 ms
on_disk 920.81 ms 903–938 ms

Overhead: ~1.4% — well within the 20% target. The mmap approach provides near-zero overhead because the OS page cache keeps the training sample in memory across iterations.

Disk space estimate for production workloads

disk_GB ≈ (256 × k × dim × 4) / 1e9

  • k=65K, dim=1024, float32: ~68 GB temp space
  • k=262K, dim=1024, float32: ~274 GB temp space

The work is done by GH copilot.

Add `on_disk` option to `KMeansParams` that spills the training sample
to a memory-mapped temp file instead of keeping it in RAM. This reduces
memory pressure during large-scale IVF index training while keeping the
core algorithm unchanged.

## Changes

- Add `memmap2` workspace dependency
- Add `pub on_disk: bool` field to `KMeansParams` (default `false`,
  no breaking change) with `with_on_disk(bool)` builder method
- Introduce internal `KMeansDataBuffer<T>` enum with `InMemory` and
  `OnDisk` variants; both expose training data as `&[T]` so the core
  algorithm (membership assignment, centroid update) is unaffected
- Write training sample to a `NamedTempFile` and `mmap` it when
  `on_disk = true`; OS page cache keeps hot pages in memory
- Add `bench_train_disk_vs_memory` benchmark (128-dim, 512 clusters)

## Benchmark results (128-dim, 512 clusters, 10 samples)

| Variant    | Time (median) | Range            |
|------------|---------------|------------------|
| in_memory  | 907.77 ms     | 889–926 ms       |
| on_disk    | 920.81 ms     | 903–938 ms       |

Overhead: ~1.4% — well within the 20% target. The mmap approach
provides near-zero overhead because the OS page cache keeps the
training sample in memory across iterations.

## Disk space estimate for production workloads

  disk_GB ≈ (256 × k × dim × 4) / 1e9

- k=65K, dim=1024, float32: ~68 GB temp space
- k=262K, dim=1024, float32: ~274 GB temp space

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the enhancement New feature or request label Feb 25, 2026
@github-actions
Copy link
Contributor

Code Review

P1: Missing unit tests

The project guidelines state: "We do not merge code without tests." This PR adds benchmarks but no unit tests for the new on_disk functionality.

Please add at least:

  • A test verifying on_disk: true produces the same centroids as on_disk: false
  • Consider adding a test for KMeansDataBuffer::on_disk() edge cases (e.g., error handling)

Example test pattern from the existing tests:

#[tokio::test]
async fn test_on_disk_produces_same_results() {
    const DIM: usize = 8;
    const K: usize = 4;
    const NUM_VALUES: usize = 256 * K;
    
    let values = generate_random_array(NUM_VALUES * DIM);
    let fsl = FixedSizeListArray::try_new_from_values(values, DIM as i32).unwrap();
    
    let params_memory = KMeansParams::default().with_on_disk(false);
    let params_disk = KMeansParams::default().with_on_disk(true);
    
    // Use same seed for deterministic comparison
    let kmeans_memory = KMeans::new_with_params(&fsl, K, &params_memory).unwrap();
    let kmeans_disk = KMeans::new_with_params(&fsl, K, &params_disk).unwrap();
    
    // Both should produce valid centroids
    assert_eq!(kmeans_memory.dimension, kmeans_disk.dimension);
    assert_eq!(kmeans_memory.centroids.len(), kmeans_disk.centroids.len());
}

P1: Consider simpler design for KMeansDataBuffer::InMemory

The InMemory variant uses a raw *const T pointer with manual unsafe impl Send/Sync, but the buffer is created and consumed within the same function scope in train_kmeans. A simpler approach would be to just keep the reference directly:

enum KMeansDataBuffer<'a, T> {
    InMemory(&'a [T]),
    OnDisk { _file: NamedTempFile, mmap: Mmap, phantom: PhantomData<T> },
}

This avoids raw pointers entirely and lets the compiler verify safety. If lifetime propagation is difficult, an alternative is to use Arc<[T]> for the in-memory case.

@codecov
Copy link

codecov bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 37.25490% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/kmeans.rs 37.25% 32 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant