Skip to content

Add background BuildKit pruning for completed batches#489

Open
neubig wants to merge 1 commit intomainfrom
background-batch-buildkit-prune
Open

Add background BuildKit pruning for completed batches#489
neubig wants to merge 1 commit intomainfrom
background-batch-buildkit-prune

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Mar 8, 2026

Summary

  • add targeted BuildKit pruning filters for completed image batches
  • run completed-batch prune work in the background with a one-batch delay and timeout
  • keep the existing threshold-based prune as a synchronous fallback when no background prune is active
  • add focused tests for the new pruning behavior

Details

  • Historical investigation across the Jan 8, Feb 5, Feb 19-20, and Mar 3-4 SWE-bench image-build runs suggests the worst failures were not a single monotonic code regression.
  • Low-footprint and high-footprint runs both included true full or near-full rebuilds; the strongest explanation is warm vs. cold registry-backed BuildKit cache rather than just final-image skip behavior.
  • The problematic March runs reached roughly 594-599 GiB of BuildKit usage by batch 19 and were still making progress when canceled (artifacts contained 471 and 285 manifest entries), which points to extreme slowdown with long synchronous docker buildx prune stalls rather than a hard deadlock.
  • This PR moves that prune work off the critical path by pruning completed batches asynchronously, with a one-batch delay so the just-finished batch is eligible while the currently running batch is not.

Testing

  • uv run pytest tests/test_buildx_utils.py
  • uv run pre-commit run --files benchmarks/utils/build_utils.py benchmarks/utils/buildx_utils.py tests/test_buildx_utils.py
  • Triggered a full minimax-m2.5 SWE-bench Verified run against OpenHands/software-agent-sdk using this benchmarks branch.

Evidence

Verification link: View conversation

Full validation workflows completed successfully:

$ gh run view 22824268127 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Run Eval (swebench) PR-489 background BuildKit prune validation","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22824268127"}

$ gh run view 22824279365 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Eval Job (swebench) PR-489 background BuildKit prune validation","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22824279365"}

Checklist

  • CI passing
  • Tests are minimal and pass
  • No unnecessary code
  • Evidence from live run (with conversation link if available)
  • All review comments resolved
  • Documentation updated (if applicable)

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig
Copy link
Contributor Author

neubig commented Mar 8, 2026

@OpenHands this has not been completely tested, try testing a run using this by triggering a full run with minimax-m2.5 and swe-bench-verified on OpenHands/software-agent-sdk

@openhands-ai
Copy link

openhands-ai bot commented Mar 8, 2026

I'm on it! neubig can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Mar 8, 2026

I triggered the requested full validation run for this PR using minimax-m2.5 on SWE-Bench Verified against OpenHands/software-agent-sdk.

Summary:

  • Reviewed PR Add background BuildKit pruning for completed batches #489 context and the diff against main.
  • Confirmed the correct eval path is the SDK run-eval.yml workflow with:
    • benchmark=swebenchprinceton-nlp/SWE-bench_Verified
    • eval_limit=500 for a full run
    • benchmarks_branch=background-batch-buildkit-prune to test this PR branch
  • Dispatched the run with:
    • sdk_ref=main
    • allow_unreleased_branches=true
    • model_ids=minimax-m2.5
    • eval_limit=500

Status:

Code changes:

  • None. No files were modified, committed, or pushed.

@neubig neubig marked this pull request as ready for review March 9, 2026 17:44
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic Solution with Minor Issues

THE GOOD:
Solves a real problem (synchronous BuildKit prune stalls during large batch builds) by moving expensive I/O off the critical path. The one-batch delay strategy ensures we don't prune layers from the currently running batch - this is good taste.

THE BAD:
Code duplication in wait() and poll() violates DRY. Tests are overly mocked and don't validate that the actual prune filters work correctly.

VERDICT: ✅ Worth merging - the core logic is sound and the evidence from the full SWE-bench run proves it works. The duplication is annoying but not blocking.

KEY INSIGHT: Moving synchronous prune work to background threads is textbook "make the slow parts async" - pragmatic engineering that solves the actual problem without over-engineering.

except Exception as e:
logger.warning("Background BuildKit prune failed: %s", e)
finally:
self._future = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Code duplication with poll() method.

Lines 136-145 here are identical to lines 115-124 in poll(). Extract this into a helper method:

Suggested change
self._future = None
def _handle_future_result(self) -> None:
"""Process completed prune future and log result."""
try:
pruned_count = self._future.result()
logger.info("Background BuildKit prune completed for %d completed images", pruned_count)
except Exception as e:
logger.warning("Background BuildKit prune failed: %s", e)
finally:
self._future = None

Then both methods can call self._handle_future_result() instead of duplicating this logic.

completed_batch_pruner.poll()
if prunable_batch:
completed_batch_pruner.enqueue_completed_batch(prunable_batch)
prunable_batch = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Add a comment explaining the one-batch delay strategy.

This is a key design decision - we prune batch N-1 while building batch N - but it's not documented in the code. Consider:

Suggested change
prunable_batch = []
# Prune the previous batch (one-batch delay ensures we don't prune
# layers from the currently running batch)
completed_batch_pruner.poll()
if prunable_batch:
completed_batch_pruner.enqueue_completed_batch(prunable_batch)
prunable_batch = []



@patch("benchmarks.utils.build_utils.tqdm", FakeTqdm)
def test_build_all_images_enqueues_previous_successful_batch_for_pruning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Tests are heavily mocked and don't validate real behavior.

Consider adding a test that validates buildkit_prune_filters_for_completed_images() generates the correct filter format without mocking everything:

def test_buildkit_prune_filters_generates_valid_docker_filter_syntax():
    """Ensure filter output matches docker buildx prune --filter format."""
    filters = buildkit_prune_filters_for_completed_images([
        "docker.io/repo/image:tag",
        "another/image:latest"
    ])
    
    # Verify filter can be used in actual docker command
    assert len(filters) == 1
    assert filters[0].startswith("description~=")
    # Verify special chars are escaped
    assert "\\." in filters[0]

The current test verifies orchestration but not that the filters will work with real docker buildx prune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants