Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 23 additions & 3 deletions .github/workflows/benchmark-multinode-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,29 @@ jobs:
run: &slurm-cleanup |
if command -v squeue >/dev/null 2>&1; then
echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
scancel --name="${{ runner.name }}" || true
while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do
squeue --name="${{ runner.name }}"
# scancel can legitimately run a while: it triggers the node epilog,
# which may be a slow/complex script. Give it 5min before giving up
# (30s was too short — it could kill an epilog that was still working).
timeout 300 scancel --name="${{ runner.name }}" 2>/dev/null || true
# Bound the drain: on the NVIDIA clusters squeue/scancel intermittently
# HANG (unresponsive slurmctld / munge / network — NOT a stuck job),
# wedging this step for 15-20min+ (observed up to 8h) and failing dsr1
# multinode legs on gb300-nv, gb200, b200 (CoreWeave unaffected). Proven
# live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same
# runner's cleanup squeue hung >6min only 14min later; gb300-nv_0 hung
# concurrently. So: timeout-wrap every slurm call (a hung squeue returns
# empty -> the while-condition is false -> loop exits and we proceed),
# and cap the whole drain at 5min with a force-KILL instead of looping
# forever. A real not-yet-cleared job still gets the full 5min to drain.
_drain_deadline=$((SECONDS + 300))
while [ -n "$(timeout 30 squeue --name='${{ runner.name }}' --noheader --format='%i' 2>/dev/null)" ]; do
if [ "$SECONDS" -ge "$_drain_deadline" ]; then
echo "[Slurm] drain exceeded 5min; force-cancelling (KILL) and proceeding"
timeout 60 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true
sleep 5
break
fi
timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hung squeue skips force-KILL

Medium Severity

The drain loop treats a timed-out squeue in the while test the same as an empty queue, so it can exit without running the _drain_deadline force-KILL block. After jobs were seen and the five-minute window may have elapsed, a hung squeue still skips scancel --signal=KILL, leaving named jobs and risking a colliding sbatch.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 00a040f. Configure here.

sleep 5
done
fi
Expand Down