[CI] Validate aggregate benchmark results before upload by edwingao28 · Pull Request #1881 · SemiAnalysisAI/InferenceX

edwingao28 · 2026-06-21T23:26:25Z

This adds a CPU-only validation gate for aggregate benchmark JSONs between result processing and upload, and fixes a missing --metric-percentiles flag that left summarize.py's P75/P95 columns rendering 0.0000.

Why

The benchmark workflows run process_result.py / process_agentic_result.py and then upload the resulting aggregate JSONs without a correctness check in between. That means missing or inconsistent fields can reach downstream summaries silently.

One concrete example: benchmark_lib.sh passed --percentile-metrics but not --metric-percentiles, so benchmark_serving.py used its default 90,99,99.9. Meanwhile summarize.py reads P75/P95 fields, so missing fields were rendered as 0.0000 instead of failing before publication.

A similar failure mode showed up in #300, where a derived interactivity field was emitted even though it did not represent a meaningful consumed metric. Both cases point to the same gap: aggregate results need a cheap contract check before upload.

What this adds

1. The missing-percentile fix. benchmark_lib.sh now passes --metric-percentiles '75,90,95,99,99.9' so the P75/P95 columns are actually produced (preserving 90,99,99.9 for compare_results.py), and a contract test pins that requested set so it can't silently regress. The original failure was uniform absence (every family missing p75/p95), which a data-derived validator can't flag without hardcoding the consumer's required ranks — and hardcoding the sweep's percentile set is exactly what this design avoids.

2. The validation gate. utils/validate_agg_result.py runs between processing and upload (single-node, multinode, agentic). It is data-derived which adapts to the metric families and percentile ranks a run emits and never hardcodes which percentiles a sweep must collect. It catches a broader class of bad aggregates the pipeline previously had no guard against:

Checks include:

Required throughput fields are present, finite, and positive.
Numeric fields are finite; percentile values are non-negative.
Identity/topology fields are well-formed (hw, framework, precision, model, conc, and single-node vs multinode topology fields).
Percentile families present in the aggregate report matching ranks, so fields consumed together are produced together.
Percentile monotonicity: latency is non-decreasing in percentile rank; interactivity is non-increasing for fixed-seq (1000 / tpot) and non-decreasing for agentic, where it is measured directly.
Malformed percentile ranks are rejected.

This also adds --metric-percentiles '75,90,95,99,99.9' so P75/P95 columns are actually produced, while preserving 90,99,99.9 for existing consumers such as compare_results.py.

Verification

cd utils && pytest test_process_result.py test_validate_agg_result.py — passing on Python 3.12, covering both fixed-seq and agentic schemas plus the CLI entry point. No GPU required.

Note

Low Risk
Changes are limited to CI gates, benchmark CLI flags, and validation tooling; no runtime inference or auth paths are modified.

Overview
Adds utils/validate_agg_result.py and runs it after aggregation so broken aggregate JSON fails CI before upload, for single-node, multinode, and agentic paths.

The validator infers which percentile ranks and metric families a run emitted and checks identity/topology, finite numerics, positive throughput, aligned percentile families (including intvty mirroring tpot), and monotonicity—with intvty direction differing for fixed-seq vs agentic.

benchmark_lib.sh now passes --metric-percentiles '75,90,95,99,99.9' so P75/P95 are actually collected for downstream summaries. Tests cover the validator and assert that percentile list in benchmark_lib.sh.

^{Reviewed by Cursor Bugbot for commit 3acbc1c. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3acbc1c126

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… JSONs

…quest p75/p95 percentiles

edwingao28 requested a review from a team June 21, 2026 23:26

github-project-automation Bot added this to InferenceMAX Board Jun 21, 2026

chatgpt-codex-connector Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread benchmarks/benchmark_lib.sh Outdated

edwingao28 added 2 commits June 21, 2026 21:44

feat: add data-derived validate_agg_result.py invariant guard for agg…

2574e4f

… JSONs

ci: validate agg results pre-upload (fixed-seq/multinode/agentic); re…

b470e3c

…quest p75/p95 percentiles

edwingao28 force-pushed the feat/validate-agg-result branch from 3acbc1c to b470e3c Compare June 22, 2026 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Validate aggregate benchmark results before upload#1881

[CI] Validate aggregate benchmark results before upload#1881
edwingao28 wants to merge 2 commits into
SemiAnalysisAI:mainfrom
edwingao28:feat/validate-agg-result

edwingao28 commented Jun 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edwingao28 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What this adds

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edwingao28 commented Jun 21, 2026 •

edited

Loading