Skip to content

Commit f9eb7e5

Browse files
author
Mark Saroufim
committed
Add troubleshooting docs for failed ARC ephemeral runners
Document the issue where failed ephemeral runners consume maxRunners slots without being garbage-collected, causing jobs to queue forever. Includes diagnosis commands and the one-liner fix.
1 parent 5d39c3d commit f9eb7e5

1 file changed

Lines changed: 34 additions & 0 deletions

File tree

.claude/skills/arc-gpu-runners.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,40 @@ sudo k3s kubectl describe node <new-node-name> | grep amd.com/gpu
181181
- **Listener not starting**: Check controller logs: `kubectl logs -n arc-systems -l app.kubernetes.io/name=gha-rs-controller`
182182
- **Runner image issues**: The image must have `/home/runner/run.sh` (GitHub Actions runner binary)
183183

184+
### Jobs queuing forever / failed ephemeral runners
185+
186+
ARC does **not** garbage-collect failed ephemeral runners. If pods fail to start (transient image pull errors, node issues, resource contention), ARC retries 5 times then marks the ephemeral runner as `Failed` with `TooManyPodFailures`. These zombie runners still count against `maxRunners`, so the autoscaler thinks the cluster is full even though the GPUs are idle.
187+
188+
**Symptoms:**
189+
- GitHub Actions jobs stuck in "Queued" indefinitely
190+
- `kubectl get autoscalingrunnerset -n arc-runners` shows `CURRENT RUNNERS` at max but `RUNNING RUNNERS` much lower
191+
- `kubectl get ephemeralrunner -n arc-runners` shows runners with status `Failed`
192+
193+
**Diagnose:**
194+
195+
```bash
196+
# Count failed vs running runners
197+
sudo k3s kubectl get ephemeralrunner -n arc-runners --no-headers | grep -c Failed
198+
sudo k3s kubectl get ephemeralrunner -n arc-runners --no-headers | grep -c Running
199+
200+
# Check GPU availability across nodes
201+
for node in mia1-p02-g29 mia1-p02-g52 mia1-p02-g53 mia1-p02-g55 mia1-p02-g56; do
202+
used=$(sudo k3s kubectl describe node $node | grep "amd.com/gpu" | tail -1 | awk '{print $2}')
203+
echo "$node: $used/8 GPUs used"
204+
done
205+
```
206+
207+
**Fix — delete the failed ephemeral runners:**
208+
209+
```bash
210+
sudo k3s kubectl get ephemeralrunner -n arc-runners --no-headers \
211+
| grep Failed \
212+
| awk '{print $1}' \
213+
| xargs sudo k3s kubectl delete ephemeralrunner -n arc-runners
214+
```
215+
216+
ARC will immediately create new ephemeral runners for queued jobs, and k8s will schedule them onto the freed GPU slots. No helm upgrade or restart needed.
217+
184218
## Current Cluster Info
185219

186220
- **Nodes**: mia1-p02-g29, mia1-p02-g52, mia1-p02-g53, mia1-p02-g55, mia1-p02-g56 (5-node k3s cluster)

0 commit comments

Comments
 (0)