[CI] Validate aggregate benchmark results before upload#1881
Open
edwingao28 wants to merge 2 commits into
Open
[CI] Validate aggregate benchmark results before upload#1881edwingao28 wants to merge 2 commits into
edwingao28 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3acbc1c126
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
3acbc1c to
b470e3c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds a CPU-only validation gate for aggregate benchmark JSONs between result processing and upload, and fixes a missing
--metric-percentilesflag that leftsummarize.py's P75/P95 columns rendering0.0000.Why
The benchmark workflows run
process_result.py/process_agentic_result.pyand then upload the resulting aggregate JSONs without a correctness check in between. That means missing or inconsistent fields can reach downstream summaries silently.One concrete example:
benchmark_lib.shpassed--percentile-metricsbut not--metric-percentiles, sobenchmark_serving.pyused its default90,99,99.9. Meanwhilesummarize.pyreads P75/P95 fields, so missing fields were rendered as0.0000instead of failing before publication.A similar failure mode showed up in #300, where a derived interactivity field was emitted even though it did not represent a meaningful consumed metric. Both cases point to the same gap: aggregate results need a cheap contract check before upload.
What this adds
1. The missing-percentile fix.
benchmark_lib.shnow passes--metric-percentiles '75,90,95,99,99.9'so the P75/P95 columns are actually produced (preserving90,99,99.9forcompare_results.py), and a contract test pins that requested set so it can't silently regress. The original failure was uniform absence (every family missing p75/p95), which a data-derived validator can't flag without hardcoding the consumer's required ranks — and hardcoding the sweep's percentile set is exactly what this design avoids.2. The validation gate.
utils/validate_agg_result.pyruns between processing and upload (single-node, multinode, agentic). It is data-derived which adapts to the metric families and percentile ranks a run emits and never hardcodes which percentiles a sweep must collect. It catches a broader class of bad aggregates the pipeline previously had no guard against:Checks include:
hw,framework,precision,model,conc, and single-node vs multinode topology fields).1000 / tpot) and non-decreasing for agentic, where it is measured directly.This also adds
--metric-percentiles '75,90,95,99,99.9'so P75/P95 columns are actually produced, while preserving90,99,99.9for existing consumers such ascompare_results.py.Verification
cd utils && pytest test_process_result.py test_validate_agg_result.py— passing on Python 3.12, covering both fixed-seq and agentic schemas plus the CLI entry point. No GPU required.Note
Low Risk
Changes are limited to CI gates, benchmark CLI flags, and validation tooling; no runtime inference or auth paths are modified.
Overview
Adds
utils/validate_agg_result.pyand runs it after aggregation so broken aggregate JSON fails CI before upload, for single-node, multinode, and agentic paths.The validator infers which percentile ranks and metric families a run emitted and checks identity/topology, finite numerics, positive throughput, aligned percentile families (including intvty mirroring tpot), and monotonicity—with intvty direction differing for fixed-seq vs agentic.
benchmark_lib.shnow passes--metric-percentiles '75,90,95,99,99.9'so P75/P95 are actually collected for downstream summaries. Tests cover the validator and assert that percentile list inbenchmark_lib.sh.Reviewed by Cursor Bugbot for commit 3acbc1c. Bugbot is set up for automated code reviews on this repo. Configure here.