fix: HIGH severity measurement bugs in KV cache benchmark by FileSystemGuy · Pull Request #398 · mlcommons/storage

FileSystemGuy · 2026-05-27T02:54:26Z

Summary

This PR fixes 6 HIGH severity bugs identified in a broad codebase scan (BROAD_SCAN.md) in the KV cache benchmark subsystem. All bugs directly corrupt reported storage metrics — inflating bandwidth, misattributing latency, or dropping data from reports. Changes are targeted fixes with no refactoring.

KVC-1 — Decode path reads full KV block N× per request; inflates bandwidth by N×

File: kv_cache_benchmark/kv_cache/benchmark.py

The batched decode loop called cache.access_cache() num_batched_reads times (= ceil(generate_tokens / decode_batch_size), default batch_size=32). Each call read the entire KV block from storage. In a real LLM decode phase the block is fetched once into GPU SRAM; subsequent steps update it in place — they do not re-read the full block from storage per batch. For 512 generate tokens with the default batch size this read the full block 16 times, inflating total_read_bytes, reported read bandwidth, read IOPS, and storage latency by 16×.

Fix: Replace the loop with a single access_cache() call.

KVC-2 — GPU simulation sleep included in end-to-end latency metric

Files: kv_cache_benchmark/kv_cache/benchmark.py, kv_cache_benchmark/kv_cache/models.py

time.sleep(generation_latency) executed between the last I/O call and request.complete_time. As a result, end_to_end_latencies (derived from total_latency_ms = complete_time - submit_time) conflated true storage I/O latency with synthetic GPU simulation time. E2E latency percentiles were meaningless as a storage metric.

Fix:

Added storage_complete_time: float = 0 field to InferenceRequest, captured immediately before the sleep.
Added storage_e2e_latency_ms property: (storage_complete_time - submit_time) * 1000.
end_to_end_latencies now records storage_e2e_latency_ms (I/O only).
Added system_e2e_latencies to retain the GPU-inclusive view via total_latency_ms.

KVC-3 — Cross-clock skew between benchmark window and request timestamps

File: kv_cache_benchmark/kv_cache/benchmark.py

benchmark_start used time.time() (wall clock) while request.complete_time and request.start_time use time.perf_counter() (monotonic). On long runs an NTP step adjustment could shift time.time() relative to perf_counter, causing actual_duration to diverge from the true elapsed window and producing incorrect throughput calculations.

Fix: Switch benchmark_start and actual_duration in run_benchmark() to time.perf_counter(), matching the clock used by all request timestamps.

KVC-4 — np.load() includes numpy deserialization in reported disk I/O time

File: kv_cache_benchmark/kv_cache/backends.py

np.load(path, allow_pickle=False) performs both disk I/O and numpy array deserialization in a single inseparable call. The measured device_time therefore included numpy header parsing and array reconstruction in addition to the actual read from disk. The subsequent np.array(data) defensive copy was attributed to host_time but is pure CPU overhead with no storage meaning.

Fix: Switch the NVMe backend from .npy to raw binary:

Write: tensor.numpy().tofile(str(path)) — raw bytes, no header overhead.
Read: open(path, 'rb').read() for pure disk I/O (device_time), then np.frombuffer(...).reshape(...) for CPU-only array reconstruction (host_time). Shape and dtype are stored in self.metadata[key].
File extension changed from .npy to .bin throughout.

KVC-5 — storage_tokens_processed undercounts by tensor_parallel factor

File: kv_cache_benchmark/kv_cache/cache.py

num_tokens = entry_size / self.model_config.kv_cache_size_per_token — entry_size is the TP-sharded per-rank size (1/tensor_parallel of the full entry), but kv_cache_size_per_token is the full unsharded value. Dividing a sharded byte count by an unsharded per-token size undercounts storage_tokens_processed by a factor of tensor_parallel for every NVMe read when tensor_parallel > 1.

Fix:

sharded_bytes_per_token = kv_cache_size_per_token / max(1, self.tensor_parallel)
num_tokens = entry_size / sharded_bytes_per_token

KVC-6 — start_time captured at dequeue but never used; no service latency reported

Files: kv_cache_benchmark/kv_cache/benchmark.py, kv_cache_benchmark/kv_cache/models.py

request.start_time was set when a worker dequeued a request (after queue wait time) but was never referenced in any latency metric. There was no way to distinguish I/O service time from queue wait time in the output. Tail latency driven by queue buildup versus actual slow storage was indistinguishable.

Fix:

Added service_latency_ms property to InferenceRequest: (complete_time - start_time) * 1000.
Added 'service_latencies': [] to the results initialization.
Records service_latency_ms in the results-recording block so P50/P99 service latency (dequeue-to-complete, excluding queue wait) appears in final stats.

Test plan

Run existing unit tests: pytest tests/unit -v
Run mlpstorage kvcache run on a small dataset and verify reported bandwidth is not inflated (KVC-1: previously 16× over for 512 token requests with default batch_size)
Verify end_to_end_latencies in results no longer includes generation sleep time (KVC-2)
Verify service_latencies section appears in KV cache benchmark output (KVC-6)
Verify NVMe backend reads/writes .bin files and results match before/after (KVC-4)

…sk I/O time np.load() on a .npy file performs both disk I/O and numpy array deserialization in a single inseparable call, so the measured device_time included numpy header parsing and array reconstruction as well as the actual read from disk. The subsequent np.array(data) defensive copy was attributed to host_time but is pure CPU overhead with no storage meaning. Neither component cleanly isolated true disk I/O latency. Fixed by storing and loading raw binary data instead of .npy files: Write path: replaced np.save(path, tensor) with tensor.numpy().tofile(str(path)). Shape and dtype are stored in self.metadata[key] (already maintained for .npy, now also used for reconstruction). Read path: replaced np.load(path) with: with open(path, 'rb') as f: raw = f.read() # pure disk I/O — device_time data = np.frombuffer(raw, dtype=...).reshape(...) # CPU only — host_time device_time now measures only the kernel read syscall(s), not numpy deserialization. host_time measures only array reconstruction. The defensive np.array(data) copy is removed since np.frombuffer already returns a new array. File extension changed from .npy to .bin throughout (path helper, __init__ glob pattern for startup cleanup).

…ding In access_cache(), storage_tokens_processed was incremented using: num_tokens = entry_size / self.model_config.kv_cache_size_per_token entry_size is the TP-sharded per-rank size (1/tensor_parallel of the full entry), while kv_cache_size_per_token is the full unsharded per-token size. Dividing a sharded byte count by an unsharded per-token size undercounts storage_tokens_processed by a factor of tensor_parallel for every NVMe read when tensor_parallel > 1. Fixed by computing the sharded bytes-per-token before dividing: sharded_bytes_per_token = kv_cache_size_per_token / max(1, self.tensor_parallel) num_tokens = entry_size / sharded_bytes_per_token self.tensor_parallel is already an instance attribute set in __init__. max(1, ...) guards against a zero or missing tensor_parallel value.

…o InferenceRequest KVC-2 — GPU simulation sleep conflated with storage latency in E2E metric complete_time was set after time.sleep(generation_latency), so total_latency_ms (complete_time - submit_time) included simulated GPU time alongside true storage I/O time. The end_to_end_latencies list therefore mixed I/O latency with synthetic GPU delay, making E2E percentiles meaningless as a storage metric. Added storage_complete_time: float = 0 field, captured in benchmark.py immediately before the generation sleep. Added: @Property storage_e2e_latency_ms -> (storage_complete_time - submit_time) * 1000 end_to_end_latencies now uses storage_e2e_latency_ms (pure I/O E2E). system_e2e_latencies (added in benchmark.py) retains total_latency_ms for the GPU-inclusive view. KVC-6 — start_time captured but never used; no service latency metric available start_time was set when a worker dequeued a request (after queue wait) but was never referenced in any latency calculation. There was no way to distinguish I/O service time from queue wait time in the output. Added: @Property service_latency_ms -> (complete_time - start_time) * 1000 service_latency_ms represents the time from dequeue to completion, excluding queue wait. benchmark.py records this into service_latencies (added alongside end_to_end_latencies) so P50/P99 service latency appears in final stats, giving operators visibility into whether tail latency is driven by queue buildup or actual I/O service time.

…he benchmark.py KVC-1 — Decode path reads full KV block N times; inflates bandwidth by N× The batched decode loop called cache.access_cache() num_batched_reads times (= ceil(generate_tokens / decode_batch_size), default batch_size=32), reading the entire KV block from storage on each iteration. In a real LLM decode phase each step appends one token; the block is read once to bring it into GPU SRAM, not re-read per batch. For 512 generate tokens with the default batch_size=32 this read the full block 16 times, inflating total_read_bytes, read bandwidth, read IOPS, and storage latency by 16×. Fixed by replacing the loop with a single access_cache() call. KVC-2 — GPU simulation sleep included in end_to_end_latencies time.sleep(generation_latency) executed between the last I/O call and request.complete_time, so end_to_end_latencies contained storage I/O latency plus simulated GPU time. Captured request.storage_complete_time immediately before the sleep. end_to_end_latencies now uses request.storage_e2e_latency_ms (I/O only). Added system_e2e_latencies initialized alongside end_to_end_latencies to retain the full GPU-inclusive view via request.total_latency_ms. KVC-3 — Cross-clock skew between benchmark window and request timestamps benchmark_start used time.time() while request.complete_time and request.start_time use time.perf_counter(). On long runs an NTP step adjustment could shift time.time() relative to perf_counter, causing actual_duration to diverge from the true elapsed window and producing inconsistent throughput calculations. Fixed by switching benchmark_start and actual_duration to time.perf_counter() throughout run_benchmark(), matching the clock used by all request timestamps. KVC-6 — start_time recorded but no service latency reported request.start_time was set at dequeue time but never used in any metric. There was no way to distinguish I/O service time from queue wait time in the output. Added 'service_latencies': [] to the results initialization block and appended request.service_latency_ms in the results-recording section so P50/P99 service latency (dequeue-to- complete, excluding queue wait) appears in final stats.

github-actions · 2026-05-27T02:54:34Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

dslik · 2026-05-27T15:38:24Z

I do not think the logic behind KVC-1 is correct.

Metrics from KVC-2, KVC-5, KVC-6 are not used in MLPerf 3.0.

KVC-4 needs more analysis.

These code changes need a more detailed analysis (ideally by the code author) and also requires additional testing.

hazemawadalla · 2026-05-27T17:00:26Z

hey Curtis, please give me a chance to review these closely

FileSystemGuy added 4 commits May 26, 2026 19:52

FileSystemGuy requested a review from a team May 27, 2026 02:54

hazemawadalla self-assigned this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: HIGH severity measurement bugs in KV cache benchmark#398

fix: HIGH severity measurement bugs in KV cache benchmark#398
FileSystemGuy wants to merge 4 commits into
mainfrom
FileSystemGuy-KVC-high

FileSystemGuy commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

dslik commented May 27, 2026

Uh oh!

hazemawadalla commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

FileSystemGuy commented May 27, 2026

Summary

KVC-1 — Decode path reads full KV block N× per request; inflates bandwidth by N×

KVC-2 — GPU simulation sleep included in end-to-end latency metric

KVC-3 — Cross-clock skew between benchmark window and request timestamps

KVC-4 — np.load() includes numpy deserialization in reported disk I/O time

KVC-5 — storage_tokens_processed undercounts by tensor_parallel factor

KVC-6 — start_time captured at dequeue but never used; no service latency reported

Test plan

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

dslik commented May 27, 2026

Uh oh!

hazemawadalla commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants