Skip to content

fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps)#1820

Open
arygupt wants to merge 2 commits into
mainfrom
fix/multinode-cleanup-timeout
Open

fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps)#1820
arygupt wants to merge 2 commits into
mainfrom
fix/multinode-cleanup-timeout

Conversation

@arygupt

@arygupt arygupt commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Problem

benchmark-multinode-tmpl.yml's "Slurm cleanup (pre-run)" step drains jobs named after the runner with no timeout:

scancel --name=$runner || true
while [ -n "$(squeue --name=$runner ...)" ]; do squeue ...; sleep 5; done

On the NVIDIA clusters, squeue/scancel hang (a zombie scancel can't reap, or an unresponsive slurmctld), so the while-condition's $(squeue ...) blocks and the step wedges 15–20 min+, failing every dsr1 multinode leg. Observed across 5 wedged sweep runs on gb300-nv, gb200, b200. CoreWeave (gb300-cw) is unaffected, so it's NVIDIA-slurm-specific — and the reusable template resolves from main for pull_request, so a branch-level fix can't apply (must land here).

Fix

  • timeout 30-wrap every scancel/squeue so a hung call can't block the loop condition.
  • Add a 120 s deadline → force-KILL + proceed instead of looping forever.

Legs then reach launch (sbatch works on these clusters — glm5-gb300-dynamo-trt succeeds), unblocking measured-power sweeps for everyone.

🤖 Generated with Claude Code


Note

Low Risk
CI-only workflow shell changes with no application, auth, or data-path impact; worst case is leaving a Slurm job uncleared after forced proceed.

Overview
The Slurm cleanup (pre-run) step in benchmark-multinode-tmpl.yml (same YAML anchor as post-run cleanup) no longer blocks indefinitely when NVIDIA-cluster squeue/scancel hang.

scancel is now wrapped in timeout 300 so slow node epilogs can finish without hanging forever. The drain while loop uses timeout 30 on every squeue (and periodic status squeue) so a stuck call returns empty instead of wedging the step. A 5-minute overall drain deadline triggers scancel --signal=KILL and exits the loop so benchmarks can proceed to sbatch even when Slurm control plane calls are unresponsive.

Reviewed by Cursor Bugbot for commit 00a040f. Bugbot is set up for automated code reviews on this repo. Configure here.

The 'Slurm cleanup (pre-run)' step waits for jobs named after the runner with
NO timeout. On the NVIDIA clusters squeue/scancel hang (a zombie scancel can't
reap, or unresponsive slurmctld), so the while-condition's $(squeue ...) blocks
and the step wedges 15-20min+, failing EVERY dsr1 multinode leg (gb300-nv,
gb200, b200; CoreWeave gb300-cw is unaffected — 5 wedged sweep runs observed).

Wrap every scancel/squeue in 'timeout 30' so a hung call can't block the loop,
and force-KILL + proceed after a 120s deadline instead of looping forever. The
benchmark legs then reach launch (sbatch works on these clusters — glm5-gb300
succeeds), unblocking measured-power sweeps for everyone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team June 18, 2026 02:02

@cquil11 cquil11 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure if this is necessary. i've never experienced this before
Can you link to runs where this actually happened?

…og headroom)

Review feedback: 30s was too short — scancel triggers the node epilog, which can be a slow/complex script, so a 30s cap could kill a cleanup that was still legitimately working. Raise scancel to 300s and the overall drain deadline to 300s; squeue stays at 30s (a hung squeue should give up fast so we proceed). A real not-yet-cleared job now gets a full 5min to drain before the force-KILL.

Proven live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same runner's cleanup squeue hung >6min 14min later, with gb300-nv_0 hanging concurrently — an intermittent cluster-wide slurmctld/munge/network hang, not a stuck job. Unbounded, the drain loop froze dsr1 multinode legs 15-20min+ (observed up to 8h).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 00a040f. Configure here.

sleep 5
break
fi
timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hung squeue skips force-KILL

Medium Severity

The drain loop treats a timed-out squeue in the while test the same as an empty queue, so it can exit without running the _drain_deadline force-KILL block. After jobs were seen and the five-minute window may have elapsed, a hung squeue still skips scancel --signal=KILL, leaving named jobs and risking a colliding sbatch.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 00a040f. Configure here.

@arygupt

arygupt commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Root-cause correction from live, non-GPU probes on 2026-06-22:

  • Probe 27988714413: exact squeue query returned immediately, full queue was empty, and there were no CG/CF/CD jobs. The host process table instead contained dozens of stuck timeout 60 sudo rm -rf .../benchmark_logs / root sudo pairs for gharunner0/1/2, some >14 days old. sudo -n true hung; timeout --kill-after=3s 10s was required to kill it.
  • Probe 27988922419: sudo -V and normal user/group/host lookups were immediate, but policy-loading sudo -n -l hung. The host has sudoers: files sss and a running sssd_sudo responder, so the remaining infra fault is in the sudo/SSSD policy path.
  • Wedged run 27987975870 already contained timeout-wrapped Slurm calls. While it was live on gharunner2, the probe saw the exact stuck timeout 60 sudo rm .../benchmark_logs process whose elapsed time matched the Actions cleanup step.

The misleading Actions step contains both Slurm cleanup and a later privileged workspace cleanup. The latter came from unmerged measured-power PR #1574, is inherited by run-only PR #1791, and is absent from main; NVIDIA launchers do not create benchmark_logs. Therefore this PR does not unblock the observed incident. The immediate repo fix is to skip that AMD-only sudo cleanup on NVIDIA and use sudo -n plus timeout --kill-after on AMD. Slurm timeouts can remain separate defensive hardening, but the current incident rationale and claims should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants