feat(kmeans): add on-disk training data option to reduce memory usage#6017
Draft
feat(kmeans): add on-disk training data option to reduce memory usage#6017
Conversation
Add `on_disk` option to `KMeansParams` that spills the training sample to a memory-mapped temp file instead of keeping it in RAM. This reduces memory pressure during large-scale IVF index training while keeping the core algorithm unchanged. ## Changes - Add `memmap2` workspace dependency - Add `pub on_disk: bool` field to `KMeansParams` (default `false`, no breaking change) with `with_on_disk(bool)` builder method - Introduce internal `KMeansDataBuffer<T>` enum with `InMemory` and `OnDisk` variants; both expose training data as `&[T]` so the core algorithm (membership assignment, centroid update) is unaffected - Write training sample to a `NamedTempFile` and `mmap` it when `on_disk = true`; OS page cache keeps hot pages in memory - Add `bench_train_disk_vs_memory` benchmark (128-dim, 512 clusters) ## Benchmark results (128-dim, 512 clusters, 10 samples) | Variant | Time (median) | Range | |------------|---------------|------------------| | in_memory | 907.77 ms | 889–926 ms | | on_disk | 920.81 ms | 903–938 ms | Overhead: ~1.4% — well within the 20% target. The mmap approach provides near-zero overhead because the OS page cache keeps the training sample in memory across iterations. ## Disk space estimate for production workloads disk_GB ≈ (256 × k × dim × 4) / 1e9 - k=65K, dim=1024, float32: ~68 GB temp space - k=262K, dim=1024, float32: ~274 GB temp space Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Code ReviewP1: Missing unit testsThe project guidelines state: "We do not merge code without tests." This PR adds benchmarks but no unit tests for the new Please add at least:
Example test pattern from the existing tests: #[tokio::test]
async fn test_on_disk_produces_same_results() {
const DIM: usize = 8;
const K: usize = 4;
const NUM_VALUES: usize = 256 * K;
let values = generate_random_array(NUM_VALUES * DIM);
let fsl = FixedSizeListArray::try_new_from_values(values, DIM as i32).unwrap();
let params_memory = KMeansParams::default().with_on_disk(false);
let params_disk = KMeansParams::default().with_on_disk(true);
// Use same seed for deterministic comparison
let kmeans_memory = KMeans::new_with_params(&fsl, K, ¶ms_memory).unwrap();
let kmeans_disk = KMeans::new_with_params(&fsl, K, ¶ms_disk).unwrap();
// Both should produce valid centroids
assert_eq!(kmeans_memory.dimension, kmeans_disk.dimension);
assert_eq!(kmeans_memory.centroids.len(), kmeans_disk.centroids.len());
}P1: Consider simpler design for
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
on_diskoption toKMeansParamsthat spills the training sample to a memory-mapped temp file instead of keeping it in RAM. This reduces memory pressure during large-scale IVF index training while keeping the core algorithm unchanged.Changes
memmap2workspace dependencypub on_disk: boolfield toKMeansParams(defaultfalse, no breaking change) withwith_on_disk(bool)builder methodKMeansDataBuffer<T>enum withInMemoryandOnDiskvariants; both expose training data as&[T]so the core algorithm (membership assignment, centroid update) is unaffectedNamedTempFileandmmapit whenon_disk = true; OS page cache keeps hot pages in memorybench_train_disk_vs_memorybenchmark (128-dim, 512 clusters)Benchmark results (128-dim, 512 clusters, 10 samples)
Overhead: ~1.4% — well within the 20% target. The mmap approach provides near-zero overhead because the OS page cache keeps the training sample in memory across iterations.
Disk space estimate for production workloads
disk_GB ≈ (256 × k × dim × 4) / 1e9
The work is done by GH copilot.