Skip to content

[codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs#1879

Merged
Oseltamivir merged 3 commits into
mainfrom
codex/strict-eval-validation-only
Jun 22, 2026
Merged

[codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs#1879
Oseltamivir merged 3 commits into
mainfrom
codex/strict-eval-validation-only

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR hardens multi-concurrency eval verification and trims two adjacent rough edges that surface on the same AMD multi-node eval path. It started as the eval-only fix split from #1878 and has since grown to also make failures attributable, stop --all-evals from scheduling 1k1k evals, and quiet sglang multi-node log spam.

Scope note: this PR now includes a serving-script logging change and a sweep-matrix change, so it is no longer "eval-only / no logging / no config". See Scope & changelog below.

The original failure was exposed by run 27896778812, job 82549778335: a job configured for ten concurrencies (1 2 4 8 16 32 64 128 256 512) finished in ~14 minutes because lm-eval ran only once at num_concurrent=512, and the existing Verify eval scores step passed even though the other nine concurrencies were never evaluated.

Changes

1. Verify every requested concurrency (the core fix)

Preserve EVAL_CONC through the AMD Docker launch. benchmarks/multi_node/amd_utils/job.slurm passed the value with the bare inheritance form -e EVAL_CONC, which reads from the Docker client environment. Behind the Slurm / nested-shell / sudo docker boundary that environment is stripped, so the container fell back to the maximum benchmark concurrency and ran a single eval. The launcher now passes an explicit, quoted assignment so the space-separated list survives as one Docker argument:

-e "EVAL_CONC=$EVAL_CONC"

Make the workflow request authoritative. The multi-node Verify eval scores step now derives the expected concurrency list independently of eval-produced metadata — EVAL_CONC when supplied, otherwise the max of CONC_LIST (matching the runtime's single-concurrency fallback) — and passes it to:

python3 utils/evals/validate_scores.py --expected-concs "${expected_concs}"

Fail closed on incomplete eval output. validate_scores.py parses --expected-concs as unique positive integers and compares the workflow request against both eval metadata and result files. Verification now fails when the input is empty/invalid/duplicated; when meta_env.json is missing/unreadable while a list was supplied; when batched metadata is absent but multiple concurrencies were requested; when metadata, completion, or result-file coverage does not exactly match the request; when any concurrency is listed as failed; or when any checked score is below its threshold. Every requested concurrency must therefore produce a result that reaches the threshold loop.

2. Make eval failures attributable by concurrency

validate_scores.py now prefixes each result line with the concurrency taken from the results*_concN.json filename, so a failing concurrency is obvious from the log:

PASS: [conc=1]   gsm8k exact_match,strict-match = 0.9591 (>= 0.91 from models.dsv4)
FAIL: [conc=512] gsm8k exact_match,strict-match = 0.0356 (< 0.91 from models.dsv4)

It also line-buffers stdout/stderr so the Loaded thresholds… / Model prefix… header prints in emission order instead of being pushed below the unbuffered stderr FAIL lines when CI merges the two streams.

3. Restrict --all-evals to 8k1k (drop 1k1k evals)

The default eval policy (mark_eval_entries) only evaluates 8k1k, but the --all-evals expansion (mark_all_eval_entries) had no sequence-length filter and marked every fixed-seq entry — including 1k1k. It now applies the same 8k1k guard: 1k1k entries pass through unmarked and are dropped by the existing evals-only filter, so --all-evals no longer schedules 1k1k evals. Blast radius is limited to the --all-evals path.

4. Reduce sglang multi-node log spam

python3 -m sglang.launch_server (prefill and decode) was launched with no --log-level, so sglang defaulted to INFO and logged every decode batch for the whole run. The launches now pass --log-level ${SGLANG_SERVER_LOG_LEVEL:-warning}, and MORI_APP_LOG_LEVEL defaults to WARNING. Both are overridable for debugging, and server readiness is detected via health endpoints (/readiness, sync.py barrier --wait-for-all-health), not log scraping, so lowering the level does not affect startup detection.

Expected behavior (verification)

For a workflow request of 1 4 8:

Runtime output Result
Metadata and results for 1 4 8, all above threshold Pass
Metadata and result only for 8 Fail: metadata/completion/result-coverage mismatch
Metadata for 1 4 8, but no result for 4 Fail: missing result file
Result for every concurrency, but one score below threshold Fail: below-threshold score (now labelled with the conc)
Missing or unreadable metadata Fail
A single requested concurrency with matching legacy metadata and one result Pass

Files changed

  • benchmarks/multi_node/amd_utils/job.slurm — explicit -e "EVAL_CONC=$EVAL_CONC"
  • .github/workflows/benchmark-multinode-tmpl.yml — workflow-authoritative --expected-concs
  • utils/evals/validate_scores.py--expected-concs, per-conc labels, line-buffered streams
  • utils/evals/test_batched_eval.py — regression coverage incl. conc labelling
  • utils/matrix_logic/generate_sweep_configs.py — 8k1k guard in mark_all_eval_entries
  • utils/matrix_logic/test_generate_sweep_configs.py — all-evals 8k1k coverage
  • benchmarks/multi_node/amd_utils/server_sglang.sh--log-level on prefill/decode launches
  • benchmarks/multi_node/amd_utils/env.shMORI_APP_LOG_LEVEL default WARNING

Scope & changelog

  • This PR does now change a serving script (server_sglang.sh, env.sh) and the sweep matrix generator, unlike the original eval-only split.
  • The logging change lowers verbosity only; it does not change server arguments that affect throughput or accuracy.
  • The --all-evals change alters which evals are scheduled (8k1k only), not benchmark results.
  • No container image, benchmark seq-len/parallelism config, or perf-changelog.yaml entry changed. utils/matrix_logic/** and the touched eval files are in the test-changelog-gate trigger paths, so that workflow runs; please confirm the gate does not require a perf-changelog.yaml entry for this set.
  • No changes under deprecated/; no DSV4 configuration changes.

Validation

  • pytest utils/evals/test_batched_eval.py7 passed
  • pytest utils/matrix_logic/test_generate_sweep_configs.py107 passed
  • pytest utils/ --deselect test_process_agentic_result.py::test_processor_loads_traces_jsonl_for_theoretical_cache374 passed, 1 deselected (the deselected agentic theoretical-cache test also fails on unmodified main and is unrelated)
  • ruff check clean on changed Python; bash -n clean on job.slurm, server_sglang.sh, env.sh
  • Simulated the nested-shell expansion: EVAL_CONC=1 2 4 8 reaches Docker as one -e argument.
  • Reproduced the per-conc labelling and stream ordering end-to-end through a merged pipe, including against the real eval artifact from run 27896968169 (which fails on a genuine conc-4 score collapse).

Type of Change

  • Bug fix
  • New feature
  • Configuration change (sweep matrix: --all-evals 8k1k-only)
  • Documentation update
  • Other (serving log-verbosity reduction)

Note

Medium Risk
Changes affect CI gating for eval-only jobs and which sweeps get scheduled under --all-evals; runtime behavior is mostly verification and logging defaults, but incorrect expected_concs logic could fail good runs or miss incomplete evals.

Overview
Multi-concurrency eval verification is tightened so CI cannot pass when only a subset of requested concurrencies ran. The AMD Slurm/Docker path now forwards EVAL_CONC as an explicit quoted -e "EVAL_CONC=$EVAL_CONC" so space-separated lists survive nested shells. The multinode workflow Verify eval scores step passes workflow-authoritative --expected-concs (EVAL_CONC or max of CONC_LIST), and validate_scores.py fails closed on manifest/metadata/result mismatches, labels each PASS/FAIL with [conc=N], and line-buffers stdout/stderr for readable CI logs.

Sweep scheduling: --all-evals (mark_all_eval_entries) now only expands 8k1k entries for eval; other sequence lengths (e.g. 1k1k) stay unmarked and drop out under --evals-only.

Operational noise: SGLang prefill/decode launches default to --log-level warning (overridable via SGLANG_SERVER_LOG_LEVEL), and MORI_APP_LOG_LEVEL defaults to WARNING in env.sh.

Reviewed by Cursor Bugbot for commit 35dc57c. Bugbot is set up for automated code reviews on this repo. Configure here.

Keep the workflow's requested concurrency list independent from eval-produced metadata, fail when any requested result is missing, and preserve space-separated EVAL_CONC through the AMD Docker launch boundary.

(cherry picked from commit 5e0f310)
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@Oseltamivir Oseltamivir marked this pull request as ready for review June 21, 2026 11:51
@Oseltamivir Oseltamivir requested a review from a team June 21, 2026 11:51
@claude

claude Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 2m 2s —— View job


Reviewing PR #1879

  • Read changed files and gather context
  • Review validate_scores.py changes
  • Review test_batched_eval.py changes
  • Review workflow template changes
  • Review job.slurm changes
  • Post review findings

LGTM - no blocking issues found.

The fix correctly addresses both root causes: the EVAL_CONC env var lost at the Docker boundary (-e EVAL_CONC-e "EVAL_CONC=$EVAL_CONC"), and the validator now takes the workflow-requested concurrency list as its source of truth via --expected-concs instead of trusting eval-produced metadata. Input validation fails closed on empty, non-positive, or duplicate values. The fallback to max(CONC_LIST) when EVAL_CONC is unset matches the runtime's single-concurrency default. Tests cover the new paths well (unreadable metadata with expectations, workflow/metadata mismatch, missing result files, below-threshold scores across a batch).

…iet sglang logs

- validate_scores.py: prefix each PASS/FAIL with [conc=N] taken from the
  result filename so a failing concurrency is identifiable, and line-buffer
  stdout/stderr so the "Loaded thresholds" header prints in emission order
  instead of after the stderr FAIL lines when CI merges the streams.
- generate_sweep_configs.py: mark_all_eval_entries now only expands evals for
  8k1k entries (matching mark_eval_entries); 1k1k entries pass through
  unmarked and are dropped by the evals-only filter, so --all-evals no longer
  schedules 1k1k evals.
- server_sglang.sh / env.sh: launch sglang prefill+decode with
  --log-level ${SGLANG_SERVER_LOG_LEVEL:-warning} and default
  MORI_APP_LOG_LEVEL to WARNING to cut multinode log spam (both overridable;
  server readiness is detected via health endpoints, not log scraping).
@Oseltamivir Oseltamivir changed the title [codex] Fail incomplete multi-concurrency evals [codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs Jun 22, 2026
Comment thread utils/evals/validate_scores.py Fixed
Address github-code-quality review: the except (AttributeError, ValueError)
guard around stream reconfigure was an undocumented empty swallow. Add a
comment noting it is best-effort (wrapped streams such as pytest's capture
object do not support reconfigure). No functional change.
@Oseltamivir Oseltamivir merged commit e791558 into main Jun 22, 2026
9 checks passed
@Oseltamivir Oseltamivir deleted the codex/strict-eval-validation-only branch June 22, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant