fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps)#1820
fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps)#1820arygupt wants to merge 2 commits into
Conversation
The 'Slurm cleanup (pre-run)' step waits for jobs named after the runner with NO timeout. On the NVIDIA clusters squeue/scancel hang (a zombie scancel can't reap, or unresponsive slurmctld), so the while-condition's $(squeue ...) blocks and the step wedges 15-20min+, failing EVERY dsr1 multinode leg (gb300-nv, gb200, b200; CoreWeave gb300-cw is unaffected — 5 wedged sweep runs observed). Wrap every scancel/squeue in 'timeout 30' so a hung call can't block the loop, and force-KILL + proceed after a 120s deadline instead of looping forever. The benchmark legs then reach launch (sbatch works on these clusters — glm5-gb300 succeeds), unblocking measured-power sweeps for everyone. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cquil11
left a comment
There was a problem hiding this comment.
unsure if this is necessary. i've never experienced this before
Can you link to runs where this actually happened?
…og headroom) Review feedback: 30s was too short — scancel triggers the node epilog, which can be a slow/complex script, so a 30s cap could kill a cleanup that was still legitimately working. Raise scancel to 300s and the overall drain deadline to 300s; squeue stays at 30s (a hung squeue should give up fast so we proceed). A real not-yet-cleared job now gets a full 5min to drain before the force-KILL. Proven live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same runner's cleanup squeue hung >6min 14min later, with gb300-nv_0 hanging concurrently — an intermittent cluster-wide slurmctld/munge/network hang, not a stuck job. Unbounded, the drain loop froze dsr1 multinode legs 15-20min+ (observed up to 8h). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 00a040f. Configure here.
| sleep 5 | ||
| break | ||
| fi | ||
| timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true |
There was a problem hiding this comment.
Hung squeue skips force-KILL
Medium Severity
The drain loop treats a timed-out squeue in the while test the same as an empty queue, so it can exit without running the _drain_deadline force-KILL block. After jobs were seen and the five-minute window may have elapsed, a hung squeue still skips scancel --signal=KILL, leaving named jobs and risking a colliding sbatch.
Reviewed by Cursor Bugbot for commit 00a040f. Configure here.
|
Root-cause correction from live, non-GPU probes on 2026-06-22:
The misleading Actions step contains both Slurm cleanup and a later privileged workspace cleanup. The latter came from unmerged measured-power PR #1574, is inherited by run-only PR #1791, and is absent from |


Problem
benchmark-multinode-tmpl.yml's "Slurm cleanup (pre-run)" step drains jobs named after the runner with no timeout:On the NVIDIA clusters,
squeue/scancelhang (a zombiescancelcan't reap, or an unresponsiveslurmctld), so thewhile-condition's$(squeue ...)blocks and the step wedges 15–20 min+, failing every dsr1 multinode leg. Observed across 5 wedged sweep runs ongb300-nv,gb200,b200. CoreWeave (gb300-cw) is unaffected, so it's NVIDIA-slurm-specific — and the reusable template resolves frommainforpull_request, so a branch-level fix can't apply (must land here).Fix
timeout 30-wrap everyscancel/squeueso a hung call can't block the loop condition.KILL+ proceed instead of looping forever.Legs then reach launch (
sbatchworks on these clusters —glm5-gb300-dynamo-trtsucceeds), unblocking measured-power sweeps for everyone.🤖 Generated with Claude Code
Note
Low Risk
CI-only workflow shell changes with no application, auth, or data-path impact; worst case is leaving a Slurm job uncleared after forced proceed.
Overview
The Slurm cleanup (pre-run) step in
benchmark-multinode-tmpl.yml(same YAML anchor as post-run cleanup) no longer blocks indefinitely when NVIDIA-clustersqueue/scancelhang.scancelis now wrapped intimeout 300so slow node epilogs can finish without hanging forever. The drainwhileloop usestimeout 30on everysqueue(and periodic statussqueue) so a stuck call returns empty instead of wedging the step. A 5-minute overall drain deadline triggersscancel --signal=KILLand exits the loop so benchmarks can proceed tosbatcheven when Slurm control plane calls are unresponsive.Reviewed by Cursor Bugbot for commit 00a040f. Bugbot is set up for automated code reviews on this repo. Configure here.