[codex] Enforce complete eval validation and quiet ATOM logs#1878
[codex] Enforce complete eval validation and quiet ATOM logs#1878Oseltamivir wants to merge 2 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Claude finished @Oseltamivir's task in 5m 18s —— View job Reviewing PR #1878
LGTM - no blocking issues found. This is a thorough and well-motivated fix for two real problems: (1) eval jobs silently passing with incomplete concurrency coverage due to Key observations from the review:
No master config or perf-changelog changes were made, consistent with the PR's scope statement. |
Summary
This fixes two issues exposed by run 27896778812, job 82549778335:
The change makes the workflow-requested concurrency list an independent verification contract, hardens eval artifact staging/collection/reuse, audits every active launcher path that forwards eval state, and adds fail-closed ATOM logging controls without adding a filter file.
Incident analysis
The affected job requested:
It finished in about 14 minutes because only concurrency
512was evaluated. The AMD multi-node Docker launch used bare-e EVAL_CONCbehindsudo. With normalsudoenvironment filtering, the Docker client did not receive the host value, so the container did not receive the requested list and the eval path fell back to the matrix concurrency.The existing verifier then validated only the files that happened to exist. It had no independent copy of the workflow request, so one valid result could look like a complete batch.
The source log had 81,074 lines. Of those, 72,378 were ATOM logger lines, including 57,941 producer/consumer INFO lines; 2,649 were Uvicorn access lines; and 2,676 were atomesh HTTP INFO lines. The controls in this PR directly target 77,703 lines, or about 95.8% of that log, while preserving warnings and errors.
Eval execution and verification
Preserve the requested list
EVAL_CONCwith an explicit value through the AMD Slurm/Docker boundary instead of relying on Docker's bare environment inheritance throughsudo.Fail on incomplete batches
The workflow now invokes
validate_scores.py --expected-concs ...with the concurrency request derived independently from the matrix inputs. Verification fails when any of the following is true:Verification runs after artifact upload and uses
success() || failure()semantics, so failed and partial eval outputs remain available for diagnosis while theVerify eval scoresstep still turns the job red.Harden artifacts and reuse
results*.jsonartifact.eval_exit_code, expected concurrencies, completed concurrencies, and failed concurrencies inmeta_env.json.collect_eval_results.py.ATOM logging controls
I checked both current upstream ROCm/ATOM (
ea08015c51aeaab40bd39b89eef009df9c148dc3) and the recipe/image-linked revision (5d42d49f9e4292e5b61475917e92e7ec1b1dacb7). The following are InferenceX compatibility controls, not native upstream ATOM variables:ATOM_LOG_LEVEL=WARNINGATOM_UVICORN_LOG_LEVEL=warningATOM_UVICORN_ACCESS_LOG=0ATOMESH_LOG_LEVEL=warnOnly atomesh exposes a native
--log-leveloption. Upstream ATOM hardcodes its Python console handler to INFO and callsuvicorn.run()without logging controls. Therefore this PR uses the existingbenchmarks/multi_node/amd_utils/setup_deps.shto patch the installed package in place before serving. No new file is introduced.The patch is idempotent and applies to spawned worker processes because each process reads the inherited environment. It:
ATOM_LOG_LEVEL;uvicorn.run();ATOMESH_LOG_LEVELto atomesh's native flag.Dependency setup and environment sourcing are explicitly guarded, so the server exits before serving if either returns nonzero. The installed-package patch also fails closed if the package shape is unexpected or if behavior checks show that ATOM INFO remains visible, warnings are duplicated/lost, propagation remains enabled, or Uvicorn access logging remains on. Setting the levels back to
INFO/infoandATOM_UVICORN_ACCESS_LOG=1restores verbose troubleshooting output.Why existing tests missed this
The previous tests mostly asserted that launch commands contained
-e EVAL_CONC; that assertion encoded the unsafe bare form instead of testing the value after thesudoboundary. Shell mocks also ran in a shared test environment and did not modelsudoenvironment filtering.Separately, score validation treated the produced artifacts as the source of truth. No test supplied an independent expected concurrency list, so there was no way to distinguish “one requested and one produced” from “ten requested and one produced.” Collector and reuse tests likewise focused on parsable rows rather than exact batch completeness.
This PR adds regression coverage for explicit environment propagation, invalid/missing/duplicate/unexpected concurrencies, command and staging failures, malformed metadata/results, metric-set drift, non-finite and below-threshold scores, collector rejection, reuse rejection, and matrix schema constraints. The workflow test path filters now include the relevant benchmark templates, eval files, AMD utilities, single-node scripts, and launchers so these changes actually trigger the gate.
Validation performed
uv run --with pytest --with pydantic --with pyyaml --with tabulate pytest utils/ -q: 413 passedbash -nover all changed shell scripts: cleanshellcheck benchmarks/multi_node/amd_utils/setup_deps.sh: cleangit diff --check: cleanLive hardware validation
The source workload is run 27896739211: two MiniMax-M3 MXFP4 MI355X ATOM-disaggregated eval jobs (
1k1kand8k1k), each requesting1 2 4 8 16 32 64 128 256 512against a 1P/1D TP4 topology.That config and launcher are still on PR #1856 rather than
main. To keep this PR scoped, the exact replay will run from a temporary integration branch with PR #1856 layered on top of this commit. The exact replay was dispatched fromcodex/strict-eval-atom-logging-replay-1856at commit83a10e21ff87f109de302f6739b4f66417d18aed: run 27902697778. It is queued; this section will be updated with the terminal result and log-volume evidence.Scope
main.perf-changelog.yamlchanges.deprecated/.Note
Cursor Bugbot is generating a summary for commit 8a923fb. Configure here.