Skip to content

fix: HIGH severity measurement and correctness bugs in vector DB benchmark#399

Open
FileSystemGuy wants to merge 4 commits into
mainfrom
FileSystemGuy-VDB-high
Open

fix: HIGH severity measurement and correctness bugs in vector DB benchmark#399
FileSystemGuy wants to merge 4 commits into
mainfrom
FileSystemGuy-VDB-high

Conversation

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Summary

This PR fixes 4 HIGH severity bugs identified in a broad codebase scan (BROAD_SCAN.md) in the vector database benchmark subsystem. The bugs produce fabricated latency distributions, silently skip correctness checks, perturb storage state during timed runs, and invalidate cross-backend QPS comparisons.


VDB-1 — P50/P90/P99 latency percentiles fabricated when batch_size > 1

File: vdb_benchmark/vdbbench/benchmark/search_runner.py

Wall-clock time for an entire batch was divided by batch_n to produce per_query_ms, and that same constant was appended once per query. All queries in a batch received an identical latency value, collapsing the distribution: P50 = P99 = average batch time / batch_size. True tail latency was invisible. The default search_batch_size is 1 so this was inactive at defaults, but the enhanced_bench.py multi-process path uses batched queries.

Fix:

  • Renamed all_latenciesbatch_latencies; records one elapsed_ms per batch instead of N identical averages.
  • P50/P90/P99 are now labelled as batch latency percentiles.
  • Mean per-query latency (elapsed_ms / batch_n) is tracked as a separate scalar.
  • interval_query_count tracks actual query count separately so QPS calculations remain correct.

VDB-2 — flat_index ground-truth rebuild path was permanently dead code

File: vdb_benchmark/vdbbench/benchmark/orchestrator.py

The guard if self.cfg.truth_mode == "flat_index" and self.truth_table is None: could never be True: self.truth_table was unconditionally assigned from the .npz file three lines earlier. The flat_index path — intended to rebuild ground truth live from the actual index state as a correctness check against stale precomputed files — never executed. Users in flat_index mode silently received recall computed against potentially stale or incorrect precomputed ground truth.

Fix: Restructured to trigger the rebuild unconditionally when truth_mode == "flat_index" before any truth_table assignment. The .npz load is skipped entirely in this mode. The file-existence guard is updated to only require gtpath when truth_mode != "flat_index".


VDB-3 — row_count() triggered full Milvus flush on every call

File: vdb_benchmark/vdbbench/benchmark/backends/milvus/backend.py

row_count() called col.flush() — a synchronous network + disk operation that seals all in-memory segments and persists them to object storage — on every invocation. row_count() is called from monitoring loops and admin introspection paths during benchmarking, so any such call triggered an unintended flush mid-run, perturbing storage state and potentially inflating or deflating subsequent I/O latency measurements.

Fix: Removed col.flush() from row_count(). Added explicit flush_collection(name) method for intentional pre-benchmark flushing. Callers that relied on the implicit flush side effect should call flush_collection() explicitly before row_count().


VDB-4 — Elasticsearch uses serial HTTP loop; cross-backend QPS comparisons invalid

File: vdb_benchmark/vdbbench/benchmark/backends/elasticsearch/backend.py

The Elasticsearch search() method sent one HTTP request per query vector in a serial for loop. Milvus accepts batched queries natively via a single RPC. SearchRunner measures wall-clock time for the entire backend.search() call and divides by batch_size for QPS. For Elasticsearch that wall time included a full HTTP round-trip (DNS + TCP + TLS + server + response) per query, making Elasticsearch appear far slower than Milvus for any batch_size > 1. Cross-backend QPS comparisons were invalid.

Fix: Rewrote search() to use the Elasticsearch _msearch (multi-search) API when len(query_vectors) > 1:

  • Single-vector queries use the existing fast path.
  • Multi-vector queries build an _msearch request body (one header + knn body per query vector) and send it in a single HTTP call, unpacking per-query hit lists from response["responses"].

Test plan

  • Run existing unit tests: pytest tests/unit -v
  • Run a batched VDB search (search_batch_size > 1) and verify P50 ≠ P99 in latency output (VDB-1)
  • Configure truth_mode = flat_index and verify ground truth is rebuilt from the live index rather than the precomputed .npz (VDB-2)
  • Verify row_count() calls during a Milvus benchmark run do not trigger unexpected flush operations (VDB-3)
  • Run an Elasticsearch search with batch_size > 1 and verify a single _msearch request is issued rather than N serial requests (VDB-4)

…ery distribution

When batch_size > 1, the wall-clock time for an entire batch was divided
by batch_n to produce per_query_ms, and that same constant was appended
once per query in the batch. All queries in a batch received an identical
latency value, so the latency distribution collapsed: P50 = P99 = average
batch time / batch_size. True tail latency within a batch was invisible.
The default search_batch_size is 1 so this was inactive at defaults, but
the enhanced_bench.py multi-process path uses batched queries.

Fixed by recording one elapsed_ms entry per batch instead of N identical
averages:
- Renamed all_latencies to batch_latencies throughout run_timed_search.
- Replaced the per-query append loop with a single batch_latencies.append(elapsed_ms).
- interval_latencies likewise records the full batch elapsed time.
- P50/P90/P99 in stats output are now labelled as batch latency percentiles.
- Mean per-query latency (elapsed_ms / batch_n) is tracked separately as
  a scalar so it remains reportable without implying a false distribution.
- interval_query_count tracks actual query count separately from
  len(batch_latencies) so QPS calculations remain correct.

This eliminates fabricated precision: operators see honest batch-level
percentiles and a scalar mean per-query estimate rather than a
statistically invalid uniform distribution.
…ntly dead code

The condition guarding the flat_index rebuild was:
  if self.cfg.truth_mode == "flat_index" and self.truth_table is None:

self.truth_table was unconditionally assigned from the .npz file three
lines earlier, so self.truth_table is None could never be True at that
point. The flat_index path — intended as a live correctness check by
rebuilding ground truth from the actual index state rather than trusting
a precomputed file — was permanently dead code. Users in flat_index mode
silently received recall computed against potentially stale or incorrect
precomputed ground truth with no rebuild ever attempted.

Fixed by restructuring the block so flat_index mode triggers the rebuild
unconditionally before any truth_table assignment:

  if self.cfg.truth_mode == "flat_index":
      self.truth_table = build_truth_from_flat(...)   # always rebuild
  else:
      gt = np.load(gtpath)
      self.truth_table = gt["truth_table"]            # use precomputed

The preceding file-existence guard is updated to only require gtpath
when truth_mode != "flat_index", avoiding a spurious FileNotFoundError
when operating in flat_index mode without a precomputed .npz present.
…h_collection()

row_count() called col.flush() on every invocation. flush() is a
synchronous network + disk operation that forces Milvus to seal all
in-memory segments and persist them to object storage. row_count() is
called from get_collection_stats() and get_collection_info(), both of
which can be invoked during monitoring loops or admin introspection at
any point during a benchmarking session. Any such call triggered an
unintended Milvus flush, perturbing storage state mid-benchmark and
potentially inflating or deflating subsequent I/O latency measurements
depending on the flush timing relative to active queries.

Fixed by removing col.flush() from row_count() so it reads
col.num_entities directly without side effects. Added an explicit
flush_collection(name) method that callers can invoke intentionally
(e.g. after bulk insert, before starting a timed benchmark run).
Callers that previously relied on the implicit flush side effect of
row_count() should be updated to call flush_collection() explicitly
before calling row_count().
…nd QPS validity

The Elasticsearch search() method sent one HTTP request per query vector
in a serial for loop. Milvus accepts batched queries natively via a single
RPC. SearchRunner measures wall-clock time for the entire backend.search()
call and divides by batch_size to compute QPS. For Elasticsearch, that
wall time included a full HTTP round-trip per query (DNS + TCP + TLS
resumption + server processing + response), making Elasticsearch appear
orders of magnitude slower than Milvus for any batch_size > 1. Cross-
backend QPS comparisons using this framework were therefore invalid.

Fixed by rewriting search() to use the Elasticsearch _msearch (multi-
search) API when len(query_vectors) > 1:
- Single-vector queries use the existing fast path (regular search()).
- Multi-vector queries build an msearch request body with one header +
  knn body per query vector, send it in a single HTTP call, then unpack
  the per-query hit lists from response["responses"].

This makes the per-query HTTP overhead for Elasticsearch comparable to
Milvus and produces valid cross-backend QPS comparisons.
@FileSystemGuy FileSystemGuy requested a review from a team May 27, 2026 02:56
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant