fix: HIGH severity measurement and correctness bugs in vector DB benchmark#399
Open
FileSystemGuy wants to merge 4 commits into
Open
fix: HIGH severity measurement and correctness bugs in vector DB benchmark#399FileSystemGuy wants to merge 4 commits into
FileSystemGuy wants to merge 4 commits into
Conversation
…ery distribution When batch_size > 1, the wall-clock time for an entire batch was divided by batch_n to produce per_query_ms, and that same constant was appended once per query in the batch. All queries in a batch received an identical latency value, so the latency distribution collapsed: P50 = P99 = average batch time / batch_size. True tail latency within a batch was invisible. The default search_batch_size is 1 so this was inactive at defaults, but the enhanced_bench.py multi-process path uses batched queries. Fixed by recording one elapsed_ms entry per batch instead of N identical averages: - Renamed all_latencies to batch_latencies throughout run_timed_search. - Replaced the per-query append loop with a single batch_latencies.append(elapsed_ms). - interval_latencies likewise records the full batch elapsed time. - P50/P90/P99 in stats output are now labelled as batch latency percentiles. - Mean per-query latency (elapsed_ms / batch_n) is tracked separately as a scalar so it remains reportable without implying a false distribution. - interval_query_count tracks actual query count separately from len(batch_latencies) so QPS calculations remain correct. This eliminates fabricated precision: operators see honest batch-level percentiles and a scalar mean per-query estimate rather than a statistically invalid uniform distribution.
…ntly dead code
The condition guarding the flat_index rebuild was:
if self.cfg.truth_mode == "flat_index" and self.truth_table is None:
self.truth_table was unconditionally assigned from the .npz file three
lines earlier, so self.truth_table is None could never be True at that
point. The flat_index path — intended as a live correctness check by
rebuilding ground truth from the actual index state rather than trusting
a precomputed file — was permanently dead code. Users in flat_index mode
silently received recall computed against potentially stale or incorrect
precomputed ground truth with no rebuild ever attempted.
Fixed by restructuring the block so flat_index mode triggers the rebuild
unconditionally before any truth_table assignment:
if self.cfg.truth_mode == "flat_index":
self.truth_table = build_truth_from_flat(...) # always rebuild
else:
gt = np.load(gtpath)
self.truth_table = gt["truth_table"] # use precomputed
The preceding file-existence guard is updated to only require gtpath
when truth_mode != "flat_index", avoiding a spurious FileNotFoundError
when operating in flat_index mode without a precomputed .npz present.
…h_collection() row_count() called col.flush() on every invocation. flush() is a synchronous network + disk operation that forces Milvus to seal all in-memory segments and persist them to object storage. row_count() is called from get_collection_stats() and get_collection_info(), both of which can be invoked during monitoring loops or admin introspection at any point during a benchmarking session. Any such call triggered an unintended Milvus flush, perturbing storage state mid-benchmark and potentially inflating or deflating subsequent I/O latency measurements depending on the flush timing relative to active queries. Fixed by removing col.flush() from row_count() so it reads col.num_entities directly without side effects. Added an explicit flush_collection(name) method that callers can invoke intentionally (e.g. after bulk insert, before starting a timed benchmark run). Callers that previously relied on the implicit flush side effect of row_count() should be updated to call flush_collection() explicitly before calling row_count().
…nd QPS validity The Elasticsearch search() method sent one HTTP request per query vector in a serial for loop. Milvus accepts batched queries natively via a single RPC. SearchRunner measures wall-clock time for the entire backend.search() call and divides by batch_size to compute QPS. For Elasticsearch, that wall time included a full HTTP round-trip per query (DNS + TCP + TLS resumption + server processing + response), making Elasticsearch appear orders of magnitude slower than Milvus for any batch_size > 1. Cross- backend QPS comparisons using this framework were therefore invalid. Fixed by rewriting search() to use the Elasticsearch _msearch (multi- search) API when len(query_vectors) > 1: - Single-vector queries use the existing fast path (regular search()). - Multi-vector queries build an msearch request body with one header + knn body per query vector, send it in a single HTTP call, then unpack the per-query hit lists from response["responses"]. This makes the per-query HTTP overhead for Elasticsearch comparable to Milvus and produces valid cross-backend QPS comparisons.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes 4 HIGH severity bugs identified in a broad codebase scan (BROAD_SCAN.md) in the vector database benchmark subsystem. The bugs produce fabricated latency distributions, silently skip correctness checks, perturb storage state during timed runs, and invalidate cross-backend QPS comparisons.
VDB-1 — P50/P90/P99 latency percentiles fabricated when batch_size > 1
File:
vdb_benchmark/vdbbench/benchmark/search_runner.pyWall-clock time for an entire batch was divided by
batch_nto produceper_query_ms, and that same constant was appended once per query. All queries in a batch received an identical latency value, collapsing the distribution: P50 = P99 = average batch time / batch_size. True tail latency was invisible. The defaultsearch_batch_sizeis 1 so this was inactive at defaults, but theenhanced_bench.pymulti-process path uses batched queries.Fix:
all_latencies→batch_latencies; records oneelapsed_msper batch instead of N identical averages.elapsed_ms / batch_n) is tracked as a separate scalar.interval_query_counttracks actual query count separately so QPS calculations remain correct.VDB-2 — flat_index ground-truth rebuild path was permanently dead code
File:
vdb_benchmark/vdbbench/benchmark/orchestrator.pyThe guard
if self.cfg.truth_mode == "flat_index" and self.truth_table is None:could never be True:self.truth_tablewas unconditionally assigned from the.npzfile three lines earlier. The flat_index path — intended to rebuild ground truth live from the actual index state as a correctness check against stale precomputed files — never executed. Users inflat_indexmode silently received recall computed against potentially stale or incorrect precomputed ground truth.Fix: Restructured to trigger the rebuild unconditionally when
truth_mode == "flat_index"before anytruth_tableassignment. The.npzload is skipped entirely in this mode. The file-existence guard is updated to only requiregtpathwhentruth_mode != "flat_index".VDB-3 — row_count() triggered full Milvus flush on every call
File:
vdb_benchmark/vdbbench/benchmark/backends/milvus/backend.pyrow_count()calledcol.flush()— a synchronous network + disk operation that seals all in-memory segments and persists them to object storage — on every invocation.row_count()is called from monitoring loops and admin introspection paths during benchmarking, so any such call triggered an unintended flush mid-run, perturbing storage state and potentially inflating or deflating subsequent I/O latency measurements.Fix: Removed
col.flush()fromrow_count(). Added explicitflush_collection(name)method for intentional pre-benchmark flushing. Callers that relied on the implicit flush side effect should callflush_collection()explicitly beforerow_count().VDB-4 — Elasticsearch uses serial HTTP loop; cross-backend QPS comparisons invalid
File:
vdb_benchmark/vdbbench/benchmark/backends/elasticsearch/backend.pyThe Elasticsearch
search()method sent one HTTP request per query vector in a serialforloop. Milvus accepts batched queries natively via a single RPC.SearchRunnermeasures wall-clock time for the entirebackend.search()call and divides bybatch_sizefor QPS. For Elasticsearch that wall time included a full HTTP round-trip (DNS + TCP + TLS + server + response) per query, making Elasticsearch appear far slower than Milvus for anybatch_size > 1. Cross-backend QPS comparisons were invalid.Fix: Rewrote
search()to use the Elasticsearch_msearch(multi-search) API whenlen(query_vectors) > 1:_msearchrequest body (one header +knnbody per query vector) and send it in a single HTTP call, unpacking per-query hit lists fromresponse["responses"].Test plan
pytest tests/unit -vsearch_batch_size > 1) and verify P50 ≠ P99 in latency output (VDB-1)truth_mode = flat_indexand verify ground truth is rebuilt from the live index rather than the precomputed.npz(VDB-2)row_count()calls during a Milvus benchmark run do not trigger unexpected flush operations (VDB-3)batch_size > 1and verify a single_msearchrequest is issued rather than N serial requests (VDB-4)