Skip to content

[CI] Validate aggregate benchmark results before upload#1881

Open
edwingao28 wants to merge 2 commits into
SemiAnalysisAI:mainfrom
edwingao28:feat/validate-agg-result
Open

[CI] Validate aggregate benchmark results before upload#1881
edwingao28 wants to merge 2 commits into
SemiAnalysisAI:mainfrom
edwingao28:feat/validate-agg-result

Conversation

@edwingao28

@edwingao28 edwingao28 commented Jun 21, 2026

Copy link
Copy Markdown

This adds a CPU-only validation gate for aggregate benchmark JSONs between result processing and upload, and fixes a missing --metric-percentiles flag that left summarize.py's P75/P95 columns rendering 0.0000.

Why

The benchmark workflows run process_result.py / process_agentic_result.py and then upload the resulting aggregate JSONs without a correctness check in between. That means missing or inconsistent fields can reach downstream summaries silently.

One concrete example: benchmark_lib.sh passed --percentile-metrics but not --metric-percentiles, so benchmark_serving.py used its default 90,99,99.9. Meanwhile summarize.py reads P75/P95 fields, so missing fields were rendered as 0.0000 instead of failing before publication.

A similar failure mode showed up in #300, where a derived interactivity field was emitted even though it did not represent a meaningful consumed metric. Both cases point to the same gap: aggregate results need a cheap contract check before upload.

What this adds

1. The missing-percentile fix. benchmark_lib.sh now passes --metric-percentiles '75,90,95,99,99.9' so the P75/P95 columns are actually produced (preserving 90,99,99.9 for compare_results.py), and a contract test pins that requested set so it can't silently regress. The original failure was uniform absence (every family missing p75/p95), which a data-derived validator can't flag without hardcoding the consumer's required ranks — and hardcoding the sweep's percentile set is exactly what this design avoids.

2. The validation gate. utils/validate_agg_result.py runs between processing and upload (single-node, multinode, agentic). It is data-derived which adapts to the metric families and percentile ranks a run emits and never hardcodes which percentiles a sweep must collect. It catches a broader class of bad aggregates the pipeline previously had no guard against:

Checks include:

  • Required throughput fields are present, finite, and positive.
  • Numeric fields are finite; percentile values are non-negative.
  • Identity/topology fields are well-formed (hw, framework, precision, model, conc, and single-node vs multinode topology fields).
  • Percentile families present in the aggregate report matching ranks, so fields consumed together are produced together.
  • Percentile monotonicity: latency is non-decreasing in percentile rank; interactivity is non-increasing for fixed-seq (1000 / tpot) and non-decreasing for agentic, where it is measured directly.
  • Malformed percentile ranks are rejected.

This also adds --metric-percentiles '75,90,95,99,99.9' so P75/P95 columns are actually produced, while preserving 90,99,99.9 for existing consumers such as compare_results.py.

Verification

cd utils && pytest test_process_result.py test_validate_agg_result.py — passing on Python 3.12, covering both fixed-seq and agentic schemas plus the CLI entry point. No GPU required.


Note

Low Risk
Changes are limited to CI gates, benchmark CLI flags, and validation tooling; no runtime inference or auth paths are modified.

Overview
Adds utils/validate_agg_result.py and runs it after aggregation so broken aggregate JSON fails CI before upload, for single-node, multinode, and agentic paths.

The validator infers which percentile ranks and metric families a run emitted and checks identity/topology, finite numerics, positive throughput, aligned percentile families (including intvty mirroring tpot), and monotonicity—with intvty direction differing for fixed-seq vs agentic.

benchmark_lib.sh now passes --metric-percentiles '75,90,95,99,99.9' so P75/P95 are actually collected for downstream summaries. Tests cover the validator and assert that percentile list in benchmark_lib.sh.

Reviewed by Cursor Bugbot for commit 3acbc1c. Bugbot is set up for automated code reviews on this repo. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3acbc1c126

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread benchmarks/benchmark_lib.sh Outdated
@edwingao28 edwingao28 force-pushed the feat/validate-agg-result branch from 3acbc1c to b470e3c Compare June 22, 2026 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant