fix: HIGH severity measurement and correctness bugs in vector DB benchmark by FileSystemGuy · Pull Request #399 · mlcommons/storage

FileSystemGuy · 2026-05-27T02:56:58Z

Summary

This PR fixes 4 HIGH severity bugs identified in a broad codebase scan (BROAD_SCAN.md) in the vector database benchmark subsystem. The bugs produce fabricated latency distributions, silently skip correctness checks, perturb storage state during timed runs, and invalidate cross-backend QPS comparisons.

VDB-1 — P50/P90/P99 latency percentiles fabricated when batch_size > 1

File: vdb_benchmark/vdbbench/benchmark/search_runner.py

Wall-clock time for an entire batch was divided by batch_n to produce per_query_ms, and that same constant was appended once per query. All queries in a batch received an identical latency value, collapsing the distribution: P50 = P99 = average batch time / batch_size. True tail latency was invisible. The default search_batch_size is 1 so this was inactive at defaults, but the enhanced_bench.py multi-process path uses batched queries.

Fix:

Renamed all_latencies → batch_latencies; records one elapsed_ms per batch instead of N identical averages.
P50/P90/P99 are now labelled as batch latency percentiles.
Mean per-query latency (elapsed_ms / batch_n) is tracked as a separate scalar.
interval_query_count tracks actual query count separately so QPS calculations remain correct.

VDB-2 — flat_index ground-truth rebuild path was permanently dead code

File: vdb_benchmark/vdbbench/benchmark/orchestrator.py

The guard if self.cfg.truth_mode == "flat_index" and self.truth_table is None: could never be True: self.truth_table was unconditionally assigned from the .npz file three lines earlier. The flat_index path — intended to rebuild ground truth live from the actual index state as a correctness check against stale precomputed files — never executed. Users in flat_index mode silently received recall computed against potentially stale or incorrect precomputed ground truth.

Fix: Restructured to trigger the rebuild unconditionally when truth_mode == "flat_index" before any truth_table assignment. The .npz load is skipped entirely in this mode. The file-existence guard is updated to only require gtpath when truth_mode != "flat_index".

VDB-3 — row_count() triggered full Milvus flush on every call

File: vdb_benchmark/vdbbench/benchmark/backends/milvus/backend.py

row_count() called col.flush() — a synchronous network + disk operation that seals all in-memory segments and persists them to object storage — on every invocation. row_count() is called from monitoring loops and admin introspection paths during benchmarking, so any such call triggered an unintended flush mid-run, perturbing storage state and potentially inflating or deflating subsequent I/O latency measurements.

Fix: Removed col.flush() from row_count(). Added explicit flush_collection(name) method for intentional pre-benchmark flushing. Callers that relied on the implicit flush side effect should call flush_collection() explicitly before row_count().

VDB-4 — Elasticsearch uses serial HTTP loop; cross-backend QPS comparisons invalid

File: vdb_benchmark/vdbbench/benchmark/backends/elasticsearch/backend.py

The Elasticsearch search() method sent one HTTP request per query vector in a serial for loop. Milvus accepts batched queries natively via a single RPC. SearchRunner measures wall-clock time for the entire backend.search() call and divides by batch_size for QPS. For Elasticsearch that wall time included a full HTTP round-trip (DNS + TCP + TLS + server + response) per query, making Elasticsearch appear far slower than Milvus for any batch_size > 1. Cross-backend QPS comparisons were invalid.

Fix: Rewrote search() to use the Elasticsearch _msearch (multi-search) API when len(query_vectors) > 1:

Single-vector queries use the existing fast path.
Multi-vector queries build an _msearch request body (one header + knn body per query vector) and send it in a single HTTP call, unpacking per-query hit lists from response["responses"].

Test plan

Run existing unit tests: pytest tests/unit -v
Run a batched VDB search (search_batch_size > 1) and verify P50 ≠ P99 in latency output (VDB-1)
Configure truth_mode = flat_index and verify ground truth is rebuilt from the live index rather than the precomputed .npz (VDB-2)
Verify row_count() calls during a Milvus benchmark run do not trigger unexpected flush operations (VDB-3)
Run an Elasticsearch search with batch_size > 1 and verify a single _msearch request is issued rather than N serial requests (VDB-4)

…ery distribution When batch_size > 1, the wall-clock time for an entire batch was divided by batch_n to produce per_query_ms, and that same constant was appended once per query in the batch. All queries in a batch received an identical latency value, so the latency distribution collapsed: P50 = P99 = average batch time / batch_size. True tail latency within a batch was invisible. The default search_batch_size is 1 so this was inactive at defaults, but the enhanced_bench.py multi-process path uses batched queries. Fixed by recording one elapsed_ms entry per batch instead of N identical averages: - Renamed all_latencies to batch_latencies throughout run_timed_search. - Replaced the per-query append loop with a single batch_latencies.append(elapsed_ms). - interval_latencies likewise records the full batch elapsed time. - P50/P90/P99 in stats output are now labelled as batch latency percentiles. - Mean per-query latency (elapsed_ms / batch_n) is tracked separately as a scalar so it remains reportable without implying a false distribution. - interval_query_count tracks actual query count separately from len(batch_latencies) so QPS calculations remain correct. This eliminates fabricated precision: operators see honest batch-level percentiles and a scalar mean per-query estimate rather than a statistically invalid uniform distribution.

…ntly dead code The condition guarding the flat_index rebuild was: if self.cfg.truth_mode == "flat_index" and self.truth_table is None: self.truth_table was unconditionally assigned from the .npz file three lines earlier, so self.truth_table is None could never be True at that point. The flat_index path — intended as a live correctness check by rebuilding ground truth from the actual index state rather than trusting a precomputed file — was permanently dead code. Users in flat_index mode silently received recall computed against potentially stale or incorrect precomputed ground truth with no rebuild ever attempted. Fixed by restructuring the block so flat_index mode triggers the rebuild unconditionally before any truth_table assignment: if self.cfg.truth_mode == "flat_index": self.truth_table = build_truth_from_flat(...) # always rebuild else: gt = np.load(gtpath) self.truth_table = gt["truth_table"] # use precomputed The preceding file-existence guard is updated to only require gtpath when truth_mode != "flat_index", avoiding a spurious FileNotFoundError when operating in flat_index mode without a precomputed .npz present.

…h_collection() row_count() called col.flush() on every invocation. flush() is a synchronous network + disk operation that forces Milvus to seal all in-memory segments and persist them to object storage. row_count() is called from get_collection_stats() and get_collection_info(), both of which can be invoked during monitoring loops or admin introspection at any point during a benchmarking session. Any such call triggered an unintended Milvus flush, perturbing storage state mid-benchmark and potentially inflating or deflating subsequent I/O latency measurements depending on the flush timing relative to active queries. Fixed by removing col.flush() from row_count() so it reads col.num_entities directly without side effects. Added an explicit flush_collection(name) method that callers can invoke intentionally (e.g. after bulk insert, before starting a timed benchmark run). Callers that previously relied on the implicit flush side effect of row_count() should be updated to call flush_collection() explicitly before calling row_count().

…nd QPS validity The Elasticsearch search() method sent one HTTP request per query vector in a serial for loop. Milvus accepts batched queries natively via a single RPC. SearchRunner measures wall-clock time for the entire backend.search() call and divides by batch_size to compute QPS. For Elasticsearch, that wall time included a full HTTP round-trip per query (DNS + TCP + TLS resumption + server processing + response), making Elasticsearch appear orders of magnitude slower than Milvus for any batch_size > 1. Cross- backend QPS comparisons using this framework were therefore invalid. Fixed by rewriting search() to use the Elasticsearch _msearch (multi- search) API when len(query_vectors) > 1: - Single-vector queries use the existing fast path (regular search()). - Multi-vector queries build an msearch request body with one header + knn body per query vector, send it in a single HTTP call, then unpack the per-query hit lists from response["responses"]. This makes the per-query HTTP overhead for Elasticsearch comparable to Milvus and produces valid cross-backend QPS comparisons.

github-actions · 2026-05-27T02:57:08Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

FileSystemGuy added 4 commits May 26, 2026 19:55

FileSystemGuy requested a review from a team May 27, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: HIGH severity measurement and correctness bugs in vector DB benchmark#399

fix: HIGH severity measurement and correctness bugs in vector DB benchmark#399
FileSystemGuy wants to merge 4 commits into
mainfrom
FileSystemGuy-VDB-high

FileSystemGuy commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FileSystemGuy commented May 27, 2026

Summary

VDB-1 — P50/P90/P99 latency percentiles fabricated when batch_size > 1

VDB-2 — flat_index ground-truth rebuild path was permanently dead code

VDB-3 — row_count() triggered full Milvus flush on every call

VDB-4 — Elasticsearch uses serial HTTP loop; cross-backend QPS comparisons invalid

Test plan

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant