Skip to content

fix: HIGH severity measurement bugs in KV cache benchmark#398

Open
FileSystemGuy wants to merge 4 commits into
mainfrom
FileSystemGuy-KVC-high
Open

fix: HIGH severity measurement bugs in KV cache benchmark#398
FileSystemGuy wants to merge 4 commits into
mainfrom
FileSystemGuy-KVC-high

Conversation

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Summary

This PR fixes 6 HIGH severity bugs identified in a broad codebase scan (BROAD_SCAN.md) in the KV cache benchmark subsystem. All bugs directly corrupt reported storage metrics — inflating bandwidth, misattributing latency, or dropping data from reports. Changes are targeted fixes with no refactoring.


KVC-1 — Decode path reads full KV block N× per request; inflates bandwidth by N×

File: kv_cache_benchmark/kv_cache/benchmark.py

The batched decode loop called cache.access_cache() num_batched_reads times (= ceil(generate_tokens / decode_batch_size), default batch_size=32). Each call read the entire KV block from storage. In a real LLM decode phase the block is fetched once into GPU SRAM; subsequent steps update it in place — they do not re-read the full block from storage per batch. For 512 generate tokens with the default batch size this read the full block 16 times, inflating total_read_bytes, reported read bandwidth, read IOPS, and storage latency by 16×.

Fix: Replace the loop with a single access_cache() call.


KVC-2 — GPU simulation sleep included in end-to-end latency metric

Files: kv_cache_benchmark/kv_cache/benchmark.py, kv_cache_benchmark/kv_cache/models.py

time.sleep(generation_latency) executed between the last I/O call and request.complete_time. As a result, end_to_end_latencies (derived from total_latency_ms = complete_time - submit_time) conflated true storage I/O latency with synthetic GPU simulation time. E2E latency percentiles were meaningless as a storage metric.

Fix:

  • Added storage_complete_time: float = 0 field to InferenceRequest, captured immediately before the sleep.
  • Added storage_e2e_latency_ms property: (storage_complete_time - submit_time) * 1000.
  • end_to_end_latencies now records storage_e2e_latency_ms (I/O only).
  • Added system_e2e_latencies to retain the GPU-inclusive view via total_latency_ms.

KVC-3 — Cross-clock skew between benchmark window and request timestamps

File: kv_cache_benchmark/kv_cache/benchmark.py

benchmark_start used time.time() (wall clock) while request.complete_time and request.start_time use time.perf_counter() (monotonic). On long runs an NTP step adjustment could shift time.time() relative to perf_counter, causing actual_duration to diverge from the true elapsed window and producing incorrect throughput calculations.

Fix: Switch benchmark_start and actual_duration in run_benchmark() to time.perf_counter(), matching the clock used by all request timestamps.


KVC-4 — np.load() includes numpy deserialization in reported disk I/O time

File: kv_cache_benchmark/kv_cache/backends.py

np.load(path, allow_pickle=False) performs both disk I/O and numpy array deserialization in a single inseparable call. The measured device_time therefore included numpy header parsing and array reconstruction in addition to the actual read from disk. The subsequent np.array(data) defensive copy was attributed to host_time but is pure CPU overhead with no storage meaning.

Fix: Switch the NVMe backend from .npy to raw binary:

  • Write: tensor.numpy().tofile(str(path)) — raw bytes, no header overhead.
  • Read: open(path, 'rb').read() for pure disk I/O (device_time), then np.frombuffer(...).reshape(...) for CPU-only array reconstruction (host_time). Shape and dtype are stored in self.metadata[key].
  • File extension changed from .npy to .bin throughout.

KVC-5 — storage_tokens_processed undercounts by tensor_parallel factor

File: kv_cache_benchmark/kv_cache/cache.py

num_tokens = entry_size / self.model_config.kv_cache_size_per_tokenentry_size is the TP-sharded per-rank size (1/tensor_parallel of the full entry), but kv_cache_size_per_token is the full unsharded value. Dividing a sharded byte count by an unsharded per-token size undercounts storage_tokens_processed by a factor of tensor_parallel for every NVMe read when tensor_parallel > 1.

Fix:

sharded_bytes_per_token = kv_cache_size_per_token / max(1, self.tensor_parallel)
num_tokens = entry_size / sharded_bytes_per_token

KVC-6 — start_time captured at dequeue but never used; no service latency reported

Files: kv_cache_benchmark/kv_cache/benchmark.py, kv_cache_benchmark/kv_cache/models.py

request.start_time was set when a worker dequeued a request (after queue wait time) but was never referenced in any latency metric. There was no way to distinguish I/O service time from queue wait time in the output. Tail latency driven by queue buildup versus actual slow storage was indistinguishable.

Fix:

  • Added service_latency_ms property to InferenceRequest: (complete_time - start_time) * 1000.
  • Added 'service_latencies': [] to the results initialization.
  • Records service_latency_ms in the results-recording block so P50/P99 service latency (dequeue-to-complete, excluding queue wait) appears in final stats.

Test plan

  • Run existing unit tests: pytest tests/unit -v
  • Run mlpstorage kvcache run on a small dataset and verify reported bandwidth is not inflated (KVC-1: previously 16× over for 512 token requests with default batch_size)
  • Verify end_to_end_latencies in results no longer includes generation sleep time (KVC-2)
  • Verify service_latencies section appears in KV cache benchmark output (KVC-6)
  • Verify NVMe backend reads/writes .bin files and results match before/after (KVC-4)

…sk I/O time

np.load() on a .npy file performs both disk I/O and numpy array
deserialization in a single inseparable call, so the measured device_time
included numpy header parsing and array reconstruction as well as the
actual read from disk. The subsequent np.array(data) defensive copy was
attributed to host_time but is pure CPU overhead with no storage meaning.
Neither component cleanly isolated true disk I/O latency.

Fixed by storing and loading raw binary data instead of .npy files:

Write path: replaced np.save(path, tensor) with tensor.numpy().tofile(str(path)).
Shape and dtype are stored in self.metadata[key] (already maintained
for .npy, now also used for reconstruction).

Read path: replaced np.load(path) with:
  with open(path, 'rb') as f:
      raw = f.read()           # pure disk I/O — device_time
  data = np.frombuffer(raw, dtype=...).reshape(...)   # CPU only — host_time

device_time now measures only the kernel read syscall(s), not numpy
deserialization. host_time measures only array reconstruction. The
defensive np.array(data) copy is removed since np.frombuffer already
returns a new array.

File extension changed from .npy to .bin throughout (path helper,
__init__ glob pattern for startup cleanup).
…ding

In access_cache(), storage_tokens_processed was incremented using:
  num_tokens = entry_size / self.model_config.kv_cache_size_per_token

entry_size is the TP-sharded per-rank size (1/tensor_parallel of the
full entry), while kv_cache_size_per_token is the full unsharded
per-token size. Dividing a sharded byte count by an unsharded per-token
size undercounts storage_tokens_processed by a factor of tensor_parallel
for every NVMe read when tensor_parallel > 1.

Fixed by computing the sharded bytes-per-token before dividing:
  sharded_bytes_per_token = kv_cache_size_per_token / max(1, self.tensor_parallel)
  num_tokens = entry_size / sharded_bytes_per_token

self.tensor_parallel is already an instance attribute set in __init__.
max(1, ...) guards against a zero or missing tensor_parallel value.
…o InferenceRequest

KVC-2 — GPU simulation sleep conflated with storage latency in E2E metric
complete_time was set after time.sleep(generation_latency), so
total_latency_ms (complete_time - submit_time) included simulated GPU
time alongside true storage I/O time. The end_to_end_latencies list
therefore mixed I/O latency with synthetic GPU delay, making E2E
percentiles meaningless as a storage metric.

Added storage_complete_time: float = 0 field, captured in
benchmark.py immediately before the generation sleep. Added:
  @Property storage_e2e_latency_ms -> (storage_complete_time - submit_time) * 1000
end_to_end_latencies now uses storage_e2e_latency_ms (pure I/O E2E).
system_e2e_latencies (added in benchmark.py) retains total_latency_ms
for the GPU-inclusive view.

KVC-6 — start_time captured but never used; no service latency metric available
start_time was set when a worker dequeued a request (after queue wait)
but was never referenced in any latency calculation. There was no way
to distinguish I/O service time from queue wait time in the output.

Added:
  @Property service_latency_ms -> (complete_time - start_time) * 1000
service_latency_ms represents the time from dequeue to completion,
excluding queue wait. benchmark.py records this into service_latencies
(added alongside end_to_end_latencies) so P50/P99 service latency
appears in final stats, giving operators visibility into whether tail
latency is driven by queue buildup or actual I/O service time.
…he benchmark.py

KVC-1 — Decode path reads full KV block N times; inflates bandwidth by N×
The batched decode loop called cache.access_cache() num_batched_reads
times (= ceil(generate_tokens / decode_batch_size), default batch_size=32),
reading the entire KV block from storage on each iteration. In a real
LLM decode phase each step appends one token; the block is read once to
bring it into GPU SRAM, not re-read per batch. For 512 generate tokens
with the default batch_size=32 this read the full block 16 times,
inflating total_read_bytes, read bandwidth, read IOPS, and storage
latency by 16×. Fixed by replacing the loop with a single access_cache()
call.

KVC-2 — GPU simulation sleep included in end_to_end_latencies
time.sleep(generation_latency) executed between the last I/O call and
request.complete_time, so end_to_end_latencies contained storage I/O
latency plus simulated GPU time. Captured request.storage_complete_time
immediately before the sleep. end_to_end_latencies now uses
request.storage_e2e_latency_ms (I/O only). Added system_e2e_latencies
initialized alongside end_to_end_latencies to retain the full
GPU-inclusive view via request.total_latency_ms.

KVC-3 — Cross-clock skew between benchmark window and request timestamps
benchmark_start used time.time() while request.complete_time and
request.start_time use time.perf_counter(). On long runs an NTP step
adjustment could shift time.time() relative to perf_counter, causing
actual_duration to diverge from the true elapsed window and producing
inconsistent throughput calculations. Fixed by switching
benchmark_start and actual_duration to time.perf_counter() throughout
run_benchmark(), matching the clock used by all request timestamps.

KVC-6 — start_time recorded but no service latency reported
request.start_time was set at dequeue time but never used in any
metric. There was no way to distinguish I/O service time from queue
wait time in the output. Added 'service_latencies': [] to the results
initialization block and appended request.service_latency_ms in the
results-recording section so P50/P99 service latency (dequeue-to-
complete, excluding queue wait) appears in final stats.
@FileSystemGuy FileSystemGuy requested a review from a team May 27, 2026 02:54
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@dslik
Copy link
Copy Markdown
Contributor

dslik commented May 27, 2026

I do not think the logic behind KVC-1 is correct.

Metrics from KVC-2, KVC-5, KVC-6 are not used in MLPerf 3.0.

KVC-4 needs more analysis.

These code changes need a more detailed analysis (ideally by the code author) and also requires additional testing.

@hazemawadalla
Copy link
Copy Markdown
Contributor

hey Curtis, please give me a chance to review these closely

@hazemawadalla hazemawadalla self-assigned this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants