Skip to content

SWT-Bench image build workflows hanging indefinitely (2+ hours) #400

@juanmichelini

Description

@juanmichelini

Problem

SWT-Bench image build workflows are hanging indefinitely at the "Build and push SWT-Bench images" step, blocking evaluation runs.

Evidence

Stuck SWT-Bench Builds

Run ID Started Duration Expected Status Runner
21754330186 14:37 UTC 2h 52m+ ~10 min 🚫 Stuck blacksmith-01kgsp5fsx8n9gacjyqhx4cmxk-32vcpu
21755926698 15:27 UTC 2h 2m+ ~10 min 🚫 Stuck blacksmith-01kgss09n9rdbzzsaf6kv4nkvh-32vcpu

Configuration for stuck builds:

  • Stuck at step: "Build and push SWT-Bench images"
  • No progress updates since start time (no heartbeats/updates from runner)
  • Commit: 680ce0f564174aecd74394ef083bd421e6dbe5e1
  • Max workers: 4
  • Dataset: eth-sri/SWT-bench_Verified_bm25_27k_zsp
  • Split: test
  • Target: source-minimal

Last Successful SWT-Bench Build

Run ID Started Duration Status
21753220072 14:01 UTC 9m 42s ✅ Success

Configuration for successful build:

  • Commit: 680ce0f564174aecd74394ef083bd421e6dbe5e1 (same)
  • Max workers: 4 (same)
  • Dataset: eth-sri/SWT-bench_Verified_bm25_27k_zsp (same)
  • Split: test (same)
  • Target: source-minimal (same)
  • Completed normally in expected time

Timeline

14:01 UTC - SWT-Bench build 21753220072 starts
14:12 UTC - SWT-Bench build 21753220072 completes (9m 42s) ✅

[25 minute gap]

14:37 UTC - SWT-Bench build 21754330186 starts
          - Gets stuck immediately, no progress
          - Still running 2h 52m+ later

15:27 UTC - SWT-Bench build 21755926698 starts
          - Gets stuck immediately, no progress
          - Still running 2h 2m+ later

Something changed between 14:12 UTC and 14:37 UTC that causes builds to freeze.

Impact

2 evaluation pods blocked waiting for SWT-Bench builds:

Pod Model Benchmark Waiting For Time Wasted
eval-21754233398-claude-4-6-mr4zr Claude Opus 4.6 swtbench Run 21754330186 2h 52m+
eval-21755837737-claude-son-9gsdl Claude Sonnet 4.5 swtbench Run 21755926698 2h 2m+

These pods are stuck polling for build completion every 60 seconds and cannot start evaluation.

Analysis

What's Identical Between Working and Stuck Builds

  • ✅ Same code commit: 680ce0f564174aecd74394ef083bd421e6dbe5e1
  • ✅ Same workflow file (no changes)
  • ✅ Same configuration (max-workers, dataset, split, target)
  • ✅ Same runner type (Blacksmith 32vCPU Ubuntu 22.04)
  • ✅ Same Docker/BuildKit setup

The only difference is TIME: builds after 14:12 UTC freeze

Evidence of Complete Freeze

GitHub Actions API shows no progress updates:

Stuck build 21754330186:

{
  "status": "in_progress",
  "started_at": "2026-02-06T14:37:57Z",
  "completed_at": null,
  "updated_at": "2026-02-06T14:37:29Z"  // No updates in 2h 52m+!
}

Stuck build 21755926698:

{
  "status": "in_progress",
  "started_at": "2026-02-06T15:27:48Z",
  "completed_at": null,
  "updated_at": "2026-02-06T15:27:16Z"  // No updates in 2h 2m+!
}

No heartbeat updates indicates complete freeze, not slow progress.

Likely Causes

  1. Blacksmith runner infrastructure issue: Something changed on Blacksmith's side between 14:12-14:37 UTC

    • Runner allocation changed
    • Network/registry connectivity issue
    • Storage/disk issues
  2. Docker/BuildKit state corruption:

    • BuildKit cache corruption affecting new builds
    • Docker daemon hung/deadlocked
    • Registry (ghcr.io) connection timeout
  3. Concurrent build interference:

    • First build (14:01) ran alone → succeeded
    • Second/third builds (14:37, 15:27) may be interfering with each other
    • Potential resource contention or lock contention
  4. GitHub Actions infrastructure:

    • Runner communication issue
    • Job orchestration problem
    • Workflow dispatch timing issue

What to Check

  1. Blacksmith runner status at 14:12-14:37 UTC: Were there any incidents/changes?
  2. GitHub Container Registry (ghcr.io) status: Any outages or rate limiting?
  3. Concurrent builds: Is there lock contention in build_images.py with parallel workers?
  4. Runner disk space: BuildKit cache may have filled up

Recommendations

Immediate Actions

  1. Cancel stuck workflows (they're consuming runner resources):

    gh run cancel 21754330186 --repo All-Hands-AI/benchmarks
    gh run cancel 21755926698 --repo All-Hands-AI/benchmarks
  2. Delete blocked eval pods (they'll never complete):

    kubectl delete pod eval-21754233398-claude-4-6-mr4zr -n evaluation-jobs
    kubectl delete pod eval-21755837737-claude-son-9gsdl -n evaluation-jobs

Investigation

  1. Review runner logs (if accessible) for both stuck and successful builds
  2. Check Blacksmith runner status around 14:12-14:37 UTC
  3. Test single vs concurrent builds: Try running one SWT-Bench build in isolation
  4. Check ghcr.io rate limits: Verify if registry pushes are throttled
  5. Inspect BuildKit cache: Look for corruption or disk space issues

Preventive Measures

  1. Add timeout to build step:

    - name: Build and push SWT-Bench images
      timeout-minutes: 30  # Fail after 30 min instead of hanging forever
      run: |
        ...
  2. Add progress monitoring:

    # In build_images.py or wrapper script
    echo "Progress: Building image X of Y" every N seconds
  3. Add health checks before build:

    - name: Verify Docker/BuildKit health
      run: |
        docker info
        docker buildx inspect
        df -h
  4. Serialize SWT-Bench builds (prevent concurrent runs):

    concurrency:
      group: build-swt-bench-images  # Global, not per-ref
      cancel-in-progress: true  # Cancel old runs
  5. Add retry logic: If build hangs, auto-cancel and retry once


Environment:

  • Workflow: .github/workflows/build-swtbench-images.yml
  • Runner: Blacksmith 32vCPU Ubuntu 22.04
  • Docker Buildx: enabled
  • BuildKit: enabled (plain progress)
  • Dataset: eth-sri/SWT-bench_Verified_bm25_27k_zsp
  • Build script: benchmarks/swtbench/build_images.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions