[codex] Verify all eval concurrencies, label scores, 8k1k-only all-evals, quieter sglang logs#1879
Conversation
Keep the workflow's requested concurrency list independent from eval-produced metadata, fail when any requested result is missing, and preserve space-separated EVAL_CONC through the AMD Docker launch boundary. (cherry picked from commit 5e0f310)
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Claude finished @Oseltamivir's task in 2m 2s —— View job Reviewing PR #1879
LGTM - no blocking issues found. The fix correctly addresses both root causes: the |
…iet sglang logs
- validate_scores.py: prefix each PASS/FAIL with [conc=N] taken from the
result filename so a failing concurrency is identifiable, and line-buffer
stdout/stderr so the "Loaded thresholds" header prints in emission order
instead of after the stderr FAIL lines when CI merges the streams.
- generate_sweep_configs.py: mark_all_eval_entries now only expands evals for
8k1k entries (matching mark_eval_entries); 1k1k entries pass through
unmarked and are dropped by the evals-only filter, so --all-evals no longer
schedules 1k1k evals.
- server_sglang.sh / env.sh: launch sglang prefill+decode with
--log-level ${SGLANG_SERVER_LOG_LEVEL:-warning} and default
MORI_APP_LOG_LEVEL to WARNING to cut multinode log spam (both overridable;
server readiness is detected via health endpoints, not log scraping).
Address github-code-quality review: the except (AttributeError, ValueError) guard around stream reconfigure was an undocumented empty swallow. Add a comment noting it is best-effort (wrapped streams such as pytest's capture object do not support reconfigure). No functional change.
Description
This PR hardens multi-concurrency eval verification and trims two adjacent rough edges that surface on the same AMD multi-node eval path. It started as the eval-only fix split from #1878 and has since grown to also make failures attributable, stop
--all-evalsfrom scheduling 1k1k evals, and quiet sglang multi-node log spam.The original failure was exposed by run 27896778812, job 82549778335: a job configured for ten concurrencies (
1 2 4 8 16 32 64 128 256 512) finished in ~14 minutes because lm-eval ran only once atnum_concurrent=512, and the existingVerify eval scoresstep passed even though the other nine concurrencies were never evaluated.Changes
1. Verify every requested concurrency (the core fix)
Preserve
EVAL_CONCthrough the AMD Docker launch.benchmarks/multi_node/amd_utils/job.slurmpassed the value with the bare inheritance form-e EVAL_CONC, which reads from the Docker client environment. Behind the Slurm / nested-shell /sudo dockerboundary that environment is stripped, so the container fell back to the maximum benchmark concurrency and ran a single eval. The launcher now passes an explicit, quoted assignment so the space-separated list survives as one Docker argument:-e "EVAL_CONC=$EVAL_CONC"Make the workflow request authoritative. The multi-node
Verify eval scoresstep now derives the expected concurrency list independently of eval-produced metadata —EVAL_CONCwhen supplied, otherwise the max ofCONC_LIST(matching the runtime's single-concurrency fallback) — and passes it to:python3 utils/evals/validate_scores.py --expected-concs "${expected_concs}"Fail closed on incomplete eval output.
validate_scores.pyparses--expected-concsas unique positive integers and compares the workflow request against both eval metadata and result files. Verification now fails when the input is empty/invalid/duplicated; whenmeta_env.jsonis missing/unreadable while a list was supplied; when batched metadata is absent but multiple concurrencies were requested; when metadata, completion, or result-file coverage does not exactly match the request; when any concurrency is listed as failed; or when any checked score is below its threshold. Every requested concurrency must therefore produce a result that reaches the threshold loop.2. Make eval failures attributable by concurrency
validate_scores.pynow prefixes each result line with the concurrency taken from theresults*_concN.jsonfilename, so a failing concurrency is obvious from the log:It also line-buffers stdout/stderr so the
Loaded thresholds…/Model prefix…header prints in emission order instead of being pushed below the unbuffered stderrFAILlines when CI merges the two streams.3. Restrict
--all-evalsto 8k1k (drop 1k1k evals)The default eval policy (
mark_eval_entries) only evaluates 8k1k, but the--all-evalsexpansion (mark_all_eval_entries) had no sequence-length filter and marked every fixed-seq entry — including 1k1k. It now applies the same 8k1k guard: 1k1k entries pass through unmarked and are dropped by the existing evals-only filter, so--all-evalsno longer schedules 1k1k evals. Blast radius is limited to the--all-evalspath.4. Reduce sglang multi-node log spam
python3 -m sglang.launch_server(prefill and decode) was launched with no--log-level, so sglang defaulted to INFO and logged every decode batch for the whole run. The launches now pass--log-level ${SGLANG_SERVER_LOG_LEVEL:-warning}, andMORI_APP_LOG_LEVELdefaults toWARNING. Both are overridable for debugging, and server readiness is detected via health endpoints (/readiness,sync.py barrier --wait-for-all-health), not log scraping, so lowering the level does not affect startup detection.Expected behavior (verification)
For a workflow request of
1 4 8:1 4 8, all above threshold81 4 8, but no result for4Files changed
benchmarks/multi_node/amd_utils/job.slurm— explicit-e "EVAL_CONC=$EVAL_CONC".github/workflows/benchmark-multinode-tmpl.yml— workflow-authoritative--expected-concsutils/evals/validate_scores.py—--expected-concs, per-conc labels, line-buffered streamsutils/evals/test_batched_eval.py— regression coverage incl. conc labellingutils/matrix_logic/generate_sweep_configs.py— 8k1k guard inmark_all_eval_entriesutils/matrix_logic/test_generate_sweep_configs.py— all-evals 8k1k coveragebenchmarks/multi_node/amd_utils/server_sglang.sh—--log-levelon prefill/decode launchesbenchmarks/multi_node/amd_utils/env.sh—MORI_APP_LOG_LEVELdefault WARNINGScope & changelog
server_sglang.sh,env.sh) and the sweep matrix generator, unlike the original eval-only split.--all-evalschange alters which evals are scheduled (8k1k only), not benchmark results.perf-changelog.yamlentry changed.utils/matrix_logic/**and the touched eval files are in thetest-changelog-gatetrigger paths, so that workflow runs; please confirm the gate does not require aperf-changelog.yamlentry for this set.deprecated/; no DSV4 configuration changes.Validation
pytest utils/evals/test_batched_eval.py— 7 passedpytest utils/matrix_logic/test_generate_sweep_configs.py— 107 passedpytest utils/ --deselect test_process_agentic_result.py::test_processor_loads_traces_jsonl_for_theoretical_cache— 374 passed, 1 deselected (the deselected agentic theoretical-cache test also fails on unmodifiedmainand is unrelated)ruff checkclean on changed Python;bash -nclean onjob.slurm,server_sglang.sh,env.shEVAL_CONC=1 2 4 8reaches Docker as one-eargument.Type of Change
--all-evals8k1k-only)Note
Medium Risk
Changes affect CI gating for eval-only jobs and which sweeps get scheduled under
--all-evals; runtime behavior is mostly verification and logging defaults, but incorrectexpected_concslogic could fail good runs or miss incomplete evals.Overview
Multi-concurrency eval verification is tightened so CI cannot pass when only a subset of requested concurrencies ran. The AMD Slurm/Docker path now forwards
EVAL_CONCas an explicit quoted-e "EVAL_CONC=$EVAL_CONC"so space-separated lists survive nested shells. The multinode workflow Verify eval scores step passes workflow-authoritative--expected-concs(EVAL_CONCor max ofCONC_LIST), andvalidate_scores.pyfails closed on manifest/metadata/result mismatches, labels each PASS/FAIL with[conc=N], and line-buffers stdout/stderr for readable CI logs.Sweep scheduling:
--all-evals(mark_all_eval_entries) now only expands 8k1k entries for eval; other sequence lengths (e.g. 1k1k) stay unmarked and drop out under--evals-only.Operational noise: SGLang prefill/decode launches default to
--log-level warning(overridable viaSGLANG_SERVER_LOG_LEVEL), andMORI_APP_LOG_LEVELdefaults toWARNINGinenv.sh.Reviewed by Cursor Bugbot for commit 35dc57c. Bugbot is set up for automated code reviews on this repo. Configure here.