fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps) by arygupt · Pull Request #1820 · SemiAnalysisAI/InferenceX

arygupt · 2026-06-18T02:02:43Z

Problem

benchmark-multinode-tmpl.yml's "Slurm cleanup (pre-run)" step drains jobs named after the runner with no timeout:

scancel --name=$runner || true
while [ -n "$(squeue --name=$runner ...)" ]; do squeue ...; sleep 5; done

On the NVIDIA clusters, squeue/scancel hang (a zombie scancel can't reap, or an unresponsive slurmctld), so the while-condition's $(squeue ...) blocks and the step wedges 15–20 min+, failing every dsr1 multinode leg. Observed across 5 wedged sweep runs on gb300-nv, gb200, b200. CoreWeave (gb300-cw) is unaffected, so it's NVIDIA-slurm-specific — and the reusable template resolves from main for pull_request, so a branch-level fix can't apply (must land here).

Fix

timeout 30-wrap every scancel/squeue so a hung call can't block the loop condition.
Add a 120 s deadline → force-KILL + proceed instead of looping forever.

Legs then reach launch (sbatch works on these clusters — glm5-gb300-dynamo-trt succeeds), unblocking measured-power sweeps for everyone.

🤖 Generated with Claude Code

Note

Low Risk
CI-only workflow shell changes with no application, auth, or data-path impact; worst case is leaving a Slurm job uncleared after forced proceed.

Overview
The Slurm cleanup (pre-run) step in benchmark-multinode-tmpl.yml (same YAML anchor as post-run cleanup) no longer blocks indefinitely when NVIDIA-cluster squeue/scancel hang.

scancel is now wrapped in timeout 300 so slow node epilogs can finish without hanging forever. The drain while loop uses timeout 30 on every squeue (and periodic status squeue) so a stuck call returns empty instead of wedging the step. A 5-minute overall drain deadline triggers scancel --signal=KILL and exits the loop so benchmarks can proceed to sbatch even when Slurm control plane calls are unresponsive.

^{Reviewed by Cursor Bugbot for commit 00a040f. Bugbot is set up for automated code reviews on this repo. Configure here.}

The 'Slurm cleanup (pre-run)' step waits for jobs named after the runner with NO timeout. On the NVIDIA clusters squeue/scancel hang (a zombie scancel can't reap, or unresponsive slurmctld), so the while-condition's $(squeue ...) blocks and the step wedges 15-20min+, failing EVERY dsr1 multinode leg (gb300-nv, gb200, b200; CoreWeave gb300-cw is unaffected — 5 wedged sweep runs observed). Wrap every scancel/squeue in 'timeout 30' so a hung call can't block the loop, and force-KILL + proceed after a 120s deadline instead of looping forever. The benchmark legs then reach launch (sbatch works on these clusters — glm5-gb300 succeeds), unblocking measured-power sweeps for everyone. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cquil11

unsure if this is necessary. i've never experienced this before
Can you link to runs where this actually happened?

…og headroom) Review feedback: 30s was too short — scancel triggers the node epilog, which can be a slow/complex script, so a 30s cap could kill a cleanup that was still legitimately working. Raise scancel to 300s and the overall drain deadline to 300s; squeue stays at 30s (a hung squeue should give up fast so we proceed). A real not-yet-cleared job now gets a full 5min to drain before the force-KILL. Proven live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same runner's cleanup squeue hung >6min 14min later, with gb300-nv_0 hanging concurrently — an intermittent cluster-wide slurmctld/munge/network hang, not a stuck job. Unbounded, the drain loop froze dsr1 multinode legs 15-20min+ (observed up to 8h). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 00a040f. Configure here.}

cursor · 2026-06-22T22:42:18Z

+                sleep 5
+                break
+              fi
+              timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true


Hung squeue skips force-KILL

Medium Severity

The drain loop treats a timed-out squeue in the while test the same as an empty queue, so it can exit without running the _drain_deadline force-KILL block. After jobs were seen and the five-minute window may have elapsed, a hung squeue still skips scancel --signal=KILL, leaving named jobs and risking a colliding sbatch.

^{Reviewed by Cursor Bugbot for commit 00a040f. Configure here.}

arygupt · 2026-06-22T22:46:26Z

Root-cause correction from live, non-GPU probes on 2026-06-22:

Probe 27988714413: exact squeue query returned immediately, full queue was empty, and there were no CG/CF/CD jobs. The host process table instead contained dozens of stuck timeout 60 sudo rm -rf .../benchmark_logs / root sudo pairs for gharunner0/1/2, some >14 days old. sudo -n true hung; timeout --kill-after=3s 10s was required to kill it.
Probe 27988922419: sudo -V and normal user/group/host lookups were immediate, but policy-loading sudo -n -l hung. The host has sudoers: files sss and a running sssd_sudo responder, so the remaining infra fault is in the sudo/SSSD policy path.
Wedged run 27987975870 already contained timeout-wrapped Slurm calls. While it was live on gharunner2, the probe saw the exact stuck timeout 60 sudo rm .../benchmark_logs process whose elapsed time matched the Actions cleanup step.

The misleading Actions step contains both Slurm cleanup and a later privileged workspace cleanup. The latter came from unmerged measured-power PR #1574, is inherited by run-only PR #1791, and is absent from main; NVIDIA launchers do not create benchmark_logs. Therefore this PR does not unblock the observed incident. The immediate repo fix is to skip that AMD-only sudo cleanup on NVIDIA and use sudo -n plus timeout --kill-after on AMD. Slurm timeouts can remain separate defensive hardening, but the current incident rationale and claims should be removed.

arygupt requested a review from a team June 18, 2026 02:02

github-project-automation Bot added this to InferenceMAX Board Jun 18, 2026

cquil11 requested changes Jun 22, 2026

View reviewed changes

cursor Bot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps)#1820

fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps)#1820
arygupt wants to merge 2 commits into
mainfrom
fix/multinode-cleanup-timeout

arygupt commented Jun 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

cquil11 left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 22, 2026

Uh oh!

arygupt commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arygupt commented Jun 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 22, 2026

Choose a reason for hiding this comment

Hung squeue skips force-KILL

Uh oh!

arygupt commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arygupt commented Jun 18, 2026 •

edited by cursor Bot

Loading