fix: HIGH severity measurement bugs in KV cache benchmark#398
Open
FileSystemGuy wants to merge 4 commits into
Open
fix: HIGH severity measurement bugs in KV cache benchmark#398FileSystemGuy wants to merge 4 commits into
FileSystemGuy wants to merge 4 commits into
Conversation
…sk I/O time
np.load() on a .npy file performs both disk I/O and numpy array
deserialization in a single inseparable call, so the measured device_time
included numpy header parsing and array reconstruction as well as the
actual read from disk. The subsequent np.array(data) defensive copy was
attributed to host_time but is pure CPU overhead with no storage meaning.
Neither component cleanly isolated true disk I/O latency.
Fixed by storing and loading raw binary data instead of .npy files:
Write path: replaced np.save(path, tensor) with tensor.numpy().tofile(str(path)).
Shape and dtype are stored in self.metadata[key] (already maintained
for .npy, now also used for reconstruction).
Read path: replaced np.load(path) with:
with open(path, 'rb') as f:
raw = f.read() # pure disk I/O — device_time
data = np.frombuffer(raw, dtype=...).reshape(...) # CPU only — host_time
device_time now measures only the kernel read syscall(s), not numpy
deserialization. host_time measures only array reconstruction. The
defensive np.array(data) copy is removed since np.frombuffer already
returns a new array.
File extension changed from .npy to .bin throughout (path helper,
__init__ glob pattern for startup cleanup).
…ding In access_cache(), storage_tokens_processed was incremented using: num_tokens = entry_size / self.model_config.kv_cache_size_per_token entry_size is the TP-sharded per-rank size (1/tensor_parallel of the full entry), while kv_cache_size_per_token is the full unsharded per-token size. Dividing a sharded byte count by an unsharded per-token size undercounts storage_tokens_processed by a factor of tensor_parallel for every NVMe read when tensor_parallel > 1. Fixed by computing the sharded bytes-per-token before dividing: sharded_bytes_per_token = kv_cache_size_per_token / max(1, self.tensor_parallel) num_tokens = entry_size / sharded_bytes_per_token self.tensor_parallel is already an instance attribute set in __init__. max(1, ...) guards against a zero or missing tensor_parallel value.
…o InferenceRequest KVC-2 — GPU simulation sleep conflated with storage latency in E2E metric complete_time was set after time.sleep(generation_latency), so total_latency_ms (complete_time - submit_time) included simulated GPU time alongside true storage I/O time. The end_to_end_latencies list therefore mixed I/O latency with synthetic GPU delay, making E2E percentiles meaningless as a storage metric. Added storage_complete_time: float = 0 field, captured in benchmark.py immediately before the generation sleep. Added: @Property storage_e2e_latency_ms -> (storage_complete_time - submit_time) * 1000 end_to_end_latencies now uses storage_e2e_latency_ms (pure I/O E2E). system_e2e_latencies (added in benchmark.py) retains total_latency_ms for the GPU-inclusive view. KVC-6 — start_time captured but never used; no service latency metric available start_time was set when a worker dequeued a request (after queue wait) but was never referenced in any latency calculation. There was no way to distinguish I/O service time from queue wait time in the output. Added: @Property service_latency_ms -> (complete_time - start_time) * 1000 service_latency_ms represents the time from dequeue to completion, excluding queue wait. benchmark.py records this into service_latencies (added alongside end_to_end_latencies) so P50/P99 service latency appears in final stats, giving operators visibility into whether tail latency is driven by queue buildup or actual I/O service time.
…he benchmark.py KVC-1 — Decode path reads full KV block N times; inflates bandwidth by N× The batched decode loop called cache.access_cache() num_batched_reads times (= ceil(generate_tokens / decode_batch_size), default batch_size=32), reading the entire KV block from storage on each iteration. In a real LLM decode phase each step appends one token; the block is read once to bring it into GPU SRAM, not re-read per batch. For 512 generate tokens with the default batch_size=32 this read the full block 16 times, inflating total_read_bytes, read bandwidth, read IOPS, and storage latency by 16×. Fixed by replacing the loop with a single access_cache() call. KVC-2 — GPU simulation sleep included in end_to_end_latencies time.sleep(generation_latency) executed between the last I/O call and request.complete_time, so end_to_end_latencies contained storage I/O latency plus simulated GPU time. Captured request.storage_complete_time immediately before the sleep. end_to_end_latencies now uses request.storage_e2e_latency_ms (I/O only). Added system_e2e_latencies initialized alongside end_to_end_latencies to retain the full GPU-inclusive view via request.total_latency_ms. KVC-3 — Cross-clock skew between benchmark window and request timestamps benchmark_start used time.time() while request.complete_time and request.start_time use time.perf_counter(). On long runs an NTP step adjustment could shift time.time() relative to perf_counter, causing actual_duration to diverge from the true elapsed window and producing inconsistent throughput calculations. Fixed by switching benchmark_start and actual_duration to time.perf_counter() throughout run_benchmark(), matching the clock used by all request timestamps. KVC-6 — start_time recorded but no service latency reported request.start_time was set at dequeue time but never used in any metric. There was no way to distinguish I/O service time from queue wait time in the output. Added 'service_latencies': [] to the results initialization block and appended request.service_latency_ms in the results-recording section so P50/P99 service latency (dequeue-to- complete, excluding queue wait) appears in final stats.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
|
I do not think the logic behind KVC-1 is correct. Metrics from KVC-2, KVC-5, KVC-6 are not used in MLPerf 3.0. KVC-4 needs more analysis. These code changes need a more detailed analysis (ideally by the code author) and also requires additional testing. |
Contributor
|
hey Curtis, please give me a chance to review these closely |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes 6 HIGH severity bugs identified in a broad codebase scan (BROAD_SCAN.md) in the KV cache benchmark subsystem. All bugs directly corrupt reported storage metrics — inflating bandwidth, misattributing latency, or dropping data from reports. Changes are targeted fixes with no refactoring.
KVC-1 — Decode path reads full KV block N× per request; inflates bandwidth by N×
File:
kv_cache_benchmark/kv_cache/benchmark.pyThe batched decode loop called
cache.access_cache()num_batched_readstimes (=ceil(generate_tokens / decode_batch_size), defaultbatch_size=32). Each call read the entire KV block from storage. In a real LLM decode phase the block is fetched once into GPU SRAM; subsequent steps update it in place — they do not re-read the full block from storage per batch. For 512 generate tokens with the default batch size this read the full block 16 times, inflatingtotal_read_bytes, reported read bandwidth, read IOPS, and storage latency by 16×.Fix: Replace the loop with a single
access_cache()call.KVC-2 — GPU simulation sleep included in end-to-end latency metric
Files:
kv_cache_benchmark/kv_cache/benchmark.py,kv_cache_benchmark/kv_cache/models.pytime.sleep(generation_latency)executed between the last I/O call andrequest.complete_time. As a result,end_to_end_latencies(derived fromtotal_latency_ms = complete_time - submit_time) conflated true storage I/O latency with synthetic GPU simulation time. E2E latency percentiles were meaningless as a storage metric.Fix:
storage_complete_time: float = 0field toInferenceRequest, captured immediately before the sleep.storage_e2e_latency_msproperty:(storage_complete_time - submit_time) * 1000.end_to_end_latenciesnow recordsstorage_e2e_latency_ms(I/O only).system_e2e_latenciesto retain the GPU-inclusive view viatotal_latency_ms.KVC-3 — Cross-clock skew between benchmark window and request timestamps
File:
kv_cache_benchmark/kv_cache/benchmark.pybenchmark_startusedtime.time()(wall clock) whilerequest.complete_timeandrequest.start_timeusetime.perf_counter()(monotonic). On long runs an NTP step adjustment could shifttime.time()relative toperf_counter, causingactual_durationto diverge from the true elapsed window and producing incorrect throughput calculations.Fix: Switch
benchmark_startandactual_durationinrun_benchmark()totime.perf_counter(), matching the clock used by all request timestamps.KVC-4 — np.load() includes numpy deserialization in reported disk I/O time
File:
kv_cache_benchmark/kv_cache/backends.pynp.load(path, allow_pickle=False)performs both disk I/O and numpy array deserialization in a single inseparable call. The measureddevice_timetherefore included numpy header parsing and array reconstruction in addition to the actual read from disk. The subsequentnp.array(data)defensive copy was attributed tohost_timebut is pure CPU overhead with no storage meaning.Fix: Switch the NVMe backend from
.npyto raw binary:tensor.numpy().tofile(str(path))— raw bytes, no header overhead.open(path, 'rb').read()for pure disk I/O (device_time), thennp.frombuffer(...).reshape(...)for CPU-only array reconstruction (host_time). Shape and dtype are stored inself.metadata[key]..npyto.binthroughout.KVC-5 — storage_tokens_processed undercounts by tensor_parallel factor
File:
kv_cache_benchmark/kv_cache/cache.pynum_tokens = entry_size / self.model_config.kv_cache_size_per_token—entry_sizeis the TP-sharded per-rank size (1/tensor_parallelof the full entry), butkv_cache_size_per_tokenis the full unsharded value. Dividing a sharded byte count by an unsharded per-token size undercountsstorage_tokens_processedby a factor oftensor_parallelfor every NVMe read whentensor_parallel > 1.Fix:
KVC-6 — start_time captured at dequeue but never used; no service latency reported
Files:
kv_cache_benchmark/kv_cache/benchmark.py,kv_cache_benchmark/kv_cache/models.pyrequest.start_timewas set when a worker dequeued a request (after queue wait time) but was never referenced in any latency metric. There was no way to distinguish I/O service time from queue wait time in the output. Tail latency driven by queue buildup versus actual slow storage was indistinguishable.Fix:
service_latency_msproperty toInferenceRequest:(complete_time - start_time) * 1000.'service_latencies': []to the results initialization.service_latency_msin the results-recording block so P50/P99 service latency (dequeue-to-complete, excluding queue wait) appears in final stats.Test plan
pytest tests/unit -vmlpstorage kvcache runon a small dataset and verify reported bandwidth is not inflated (KVC-1: previously 16× over for 512 token requests with default batch_size)end_to_end_latenciesin results no longer includes generation sleep time (KVC-2)service_latenciessection appears in KV cache benchmark output (KVC-6).binfiles and results match before/after (KVC-4)