[codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs by Oseltamivir · Pull Request #1879 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-21T11:43:32Z

Description

This PR hardens multi-concurrency eval verification and trims two adjacent rough edges that surface on the same AMD multi-node eval path. It started as the eval-only fix split from #1878 and has since grown to also make failures attributable, stop --all-evals from scheduling 1k1k evals, and quiet sglang multi-node log spam.

Scope note: this PR now includes a serving-script logging change and a sweep-matrix change, so it is no longer "eval-only / no logging / no config". See Scope & changelog below.

The original failure was exposed by run 27896778812, job 82549778335: a job configured for ten concurrencies (1 2 4 8 16 32 64 128 256 512) finished in ~14 minutes because lm-eval ran only once at num_concurrent=512, and the existing Verify eval scores step passed even though the other nine concurrencies were never evaluated.

Changes

1. Verify every requested concurrency (the core fix)

Preserve EVAL_CONC through the AMD Docker launch. benchmarks/multi_node/amd_utils/job.slurm passed the value with the bare inheritance form -e EVAL_CONC, which reads from the Docker client environment. Behind the Slurm / nested-shell / sudo docker boundary that environment is stripped, so the container fell back to the maximum benchmark concurrency and ran a single eval. The launcher now passes an explicit, quoted assignment so the space-separated list survives as one Docker argument:

-e "EVAL_CONC=$EVAL_CONC"

Make the workflow request authoritative. The multi-node Verify eval scores step now derives the expected concurrency list independently of eval-produced metadata — EVAL_CONC when supplied, otherwise the max of CONC_LIST (matching the runtime's single-concurrency fallback) — and passes it to:

python3 utils/evals/validate_scores.py --expected-concs "${expected_concs}"

Fail closed on incomplete eval output. validate_scores.py parses --expected-concs as unique positive integers and compares the workflow request against both eval metadata and result files. Verification now fails when the input is empty/invalid/duplicated; when meta_env.json is missing/unreadable while a list was supplied; when batched metadata is absent but multiple concurrencies were requested; when metadata, completion, or result-file coverage does not exactly match the request; when any concurrency is listed as failed; or when any checked score is below its threshold. Every requested concurrency must therefore produce a result that reaches the threshold loop.

2. Make eval failures attributable by concurrency

validate_scores.py now prefixes each result line with the concurrency taken from the results*_concN.json filename, so a failing concurrency is obvious from the log:

PASS: [conc=1]   gsm8k exact_match,strict-match = 0.9591 (>= 0.91 from models.dsv4)
FAIL: [conc=512] gsm8k exact_match,strict-match = 0.0356 (< 0.91 from models.dsv4)

It also line-buffers stdout/stderr so the Loaded thresholds… / Model prefix… header prints in emission order instead of being pushed below the unbuffered stderr FAIL lines when CI merges the two streams.

3. Restrict `--all-evals` to 8k1k (drop 1k1k evals)

The default eval policy (mark_eval_entries) only evaluates 8k1k, but the --all-evals expansion (mark_all_eval_entries) had no sequence-length filter and marked every fixed-seq entry — including 1k1k. It now applies the same 8k1k guard: 1k1k entries pass through unmarked and are dropped by the existing evals-only filter, so --all-evals no longer schedules 1k1k evals. Blast radius is limited to the --all-evals path.

4. Reduce sglang multi-node log spam

python3 -m sglang.launch_server (prefill and decode) was launched with no --log-level, so sglang defaulted to INFO and logged every decode batch for the whole run. The launches now pass --log-level ${SGLANG_SERVER_LOG_LEVEL:-warning}, and MORI_APP_LOG_LEVEL defaults to WARNING. Both are overridable for debugging, and server readiness is detected via health endpoints (/readiness, sync.py barrier --wait-for-all-health), not log scraping, so lowering the level does not affect startup detection.

Expected behavior (verification)

For a workflow request of 1 4 8:

Runtime output	Result
Metadata and results for `1 4 8`, all above threshold	Pass
Metadata and result only for `8`	Fail: metadata/completion/result-coverage mismatch
Metadata for `1 4 8`, but no result for `4`	Fail: missing result file
Result for every concurrency, but one score below threshold	Fail: below-threshold score (now labelled with the conc)
Missing or unreadable metadata	Fail
A single requested concurrency with matching legacy metadata and one result	Pass

Files changed

benchmarks/multi_node/amd_utils/job.slurm — explicit -e "EVAL_CONC=$EVAL_CONC"
.github/workflows/benchmark-multinode-tmpl.yml — workflow-authoritative --expected-concs
utils/evals/validate_scores.py — --expected-concs, per-conc labels, line-buffered streams
utils/evals/test_batched_eval.py — regression coverage incl. conc labelling
utils/matrix_logic/generate_sweep_configs.py — 8k1k guard in mark_all_eval_entries
utils/matrix_logic/test_generate_sweep_configs.py — all-evals 8k1k coverage
benchmarks/multi_node/amd_utils/server_sglang.sh — --log-level on prefill/decode launches
benchmarks/multi_node/amd_utils/env.sh — MORI_APP_LOG_LEVEL default WARNING

Scope & changelog

This PR does now change a serving script (server_sglang.sh, env.sh) and the sweep matrix generator, unlike the original eval-only split.
The logging change lowers verbosity only; it does not change server arguments that affect throughput or accuracy.
The --all-evals change alters which evals are scheduled (8k1k only), not benchmark results.
No container image, benchmark seq-len/parallelism config, or perf-changelog.yaml entry changed. utils/matrix_logic/** and the touched eval files are in the test-changelog-gate trigger paths, so that workflow runs; please confirm the gate does not require a perf-changelog.yaml entry for this set.
No changes under deprecated/; no DSV4 configuration changes.

Validation

pytest utils/evals/test_batched_eval.py — 7 passed
pytest utils/matrix_logic/test_generate_sweep_configs.py — 107 passed
pytest utils/ --deselect test_process_agentic_result.py::test_processor_loads_traces_jsonl_for_theoretical_cache — 374 passed, 1 deselected (the deselected agentic theoretical-cache test also fails on unmodified main and is unrelated)
ruff check clean on changed Python; bash -n clean on job.slurm, server_sglang.sh, env.sh
Simulated the nested-shell expansion: EVAL_CONC=1 2 4 8 reaches Docker as one -e argument.
Reproduced the per-conc labelling and stream ordering end-to-end through a merged pipe, including against the real eval artifact from run 27896968169 (which fails on a genuine conc-4 score collapse).

Type of Change

Note

Medium Risk
Changes affect CI gating for eval-only jobs and which sweeps get scheduled under --all-evals; runtime behavior is mostly verification and logging defaults, but incorrect expected_concs logic could fail good runs or miss incomplete evals.

Overview
Multi-concurrency eval verification is tightened so CI cannot pass when only a subset of requested concurrencies ran. The AMD Slurm/Docker path now forwards EVAL_CONC as an explicit quoted -e "EVAL_CONC=$EVAL_CONC" so space-separated lists survive nested shells. The multinode workflow Verify eval scores step passes workflow-authoritative --expected-concs (EVAL_CONC or max of CONC_LIST), and validate_scores.py fails closed on manifest/metadata/result mismatches, labels each PASS/FAIL with [conc=N], and line-buffers stdout/stderr for readable CI logs.

Sweep scheduling: --all-evals (mark_all_eval_entries) now only expands 8k1k entries for eval; other sequence lengths (e.g. 1k1k) stay unmarked and drop out under --evals-only.

Operational noise: SGLang prefill/decode launches default to --log-level warning (overridable via SGLANG_SERVER_LOG_LEVEL), and MORI_APP_LOG_LEVEL defaults to WARNING in env.sh.

^{Reviewed by Cursor Bugbot for commit 35dc57c. Bugbot is set up for automated code reviews on this repo. Configure here.}

Keep the workflow's requested concurrency list independent from eval-produced metadata, fail when any requested result is missing, and preserve space-separated EVAL_CONC through the AMD Docker launch boundary. (cherry picked from commit 5e0f310)

github-actions · 2026-06-21T11:43:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-06-21T11:51:58Z

Claude finished @Oseltamivir's task in 2m 2s —— View job

Reviewing PR #1879

Read changed files and gather context
Review validate_scores.py changes
Review test_batched_eval.py changes
Review workflow template changes
Review job.slurm changes
Post review findings

LGTM - no blocking issues found.

The fix correctly addresses both root causes: the EVAL_CONC env var lost at the Docker boundary (-e EVAL_CONC → -e "EVAL_CONC=$EVAL_CONC"), and the validator now takes the workflow-requested concurrency list as its source of truth via --expected-concs instead of trusting eval-produced metadata. Input validation fails closed on empty, non-positive, or duplicate values. The fallback to max(CONC_LIST) when EVAL_CONC is unset matches the runtime's single-concurrency default. Tests cover the new paths well (unreadable metadata with expectations, workflow/metadata mismatch, missing result files, below-threshold scores across a batch).

…iet sglang logs - validate_scores.py: prefix each PASS/FAIL with [conc=N] taken from the result filename so a failing concurrency is identifiable, and line-buffer stdout/stderr so the "Loaded thresholds" header prints in emission order instead of after the stderr FAIL lines when CI merges the streams. - generate_sweep_configs.py: mark_all_eval_entries now only expands evals for 8k1k entries (matching mark_eval_entries); 1k1k entries pass through unmarked and are dropped by the evals-only filter, so --all-evals no longer schedules 1k1k evals. - server_sglang.sh / env.sh: launch sglang prefill+decode with --log-level ${SGLANG_SERVER_LOG_LEVEL:-warning} and default MORI_APP_LOG_LEVEL to WARNING to cut multinode log spam (both overridable; server readiness is detected via health endpoints, not log scraping).

Address github-code-quality review: the except (AttributeError, ValueError) guard around stream reconfigure was an undocumented empty swallow. Add a comment noting it is best-effort (wrapped streams such as pytest's capture object do not support reconfigure). No functional change.

github-project-automation Bot added this to InferenceMAX Board Jun 21, 2026

Oseltamivir marked this pull request as ready for review June 21, 2026 11:51

Oseltamivir requested a review from a team June 21, 2026 11:51

Oseltamivir changed the title ~~[codex] Fail incomplete multi-concurrency evals~~ [codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs Jun 22, 2026

github-code-quality Bot found potential problems Jun 22, 2026

View reviewed changes

Comment thread utils/evals/validate_scores.py Fixed

Oseltamivir merged commit e791558 into main Jun 22, 2026
9 checks passed

Oseltamivir deleted the codex/strict-eval-validation-only branch June 22, 2026 01:38

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 22, 2026

Oseltamivir mentioned this pull request Jun 22, 2026

[AMD] dsv4 atom-disagg eval sweep — validate reduced ATOM logging #1882

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs#1879

[codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs#1879
Oseltamivir merged 3 commits into
mainfrom
codex/strict-eval-validation-only

Oseltamivir commented Jun 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

claude Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Jun 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

1. Verify every requested concurrency (the core fix)

2. Make eval failures attributable by concurrency

3. Restrict --all-evals to 8k1k (drop 1k1k evals)

4. Reduce sglang multi-node log spam

Expected behavior (verification)

Files changed

Scope & changelog

Validation

Type of Change

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

claude Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #1879

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 21, 2026 •

edited by cursor Bot

Loading

3. Restrict `--all-evals` to 8k1k (drop 1k1k evals)

claude Bot commented Jun 21, 2026 •

edited

Loading