You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add troubleshooting docs for failed ARC ephemeral runners
Document the issue where failed ephemeral runners consume maxRunners
slots without being garbage-collected, causing jobs to queue forever.
Includes diagnosis commands and the one-liner fix.
-**Runner image issues**: The image must have `/home/runner/run.sh` (GitHub Actions runner binary)
183
183
184
+
### Jobs queuing forever / failed ephemeral runners
185
+
186
+
ARC does **not** garbage-collect failed ephemeral runners. If pods fail to start (transient image pull errors, node issues, resource contention), ARC retries 5 times then marks the ephemeral runner as `Failed` with `TooManyPodFailures`. These zombie runners still count against `maxRunners`, so the autoscaler thinks the cluster is full even though the GPUs are idle.
187
+
188
+
**Symptoms:**
189
+
- GitHub Actions jobs stuck in "Queued" indefinitely
190
+
-`kubectl get autoscalingrunnerset -n arc-runners` shows `CURRENT RUNNERS` at max but `RUNNING RUNNERS` much lower
191
+
-`kubectl get ephemeralrunner -n arc-runners` shows runners with status `Failed`
ARC will immediately create new ephemeral runners for queued jobs, and k8s will schedule them onto the freed GPU slots. No helm upgrade or restart needed.
0 commit comments