Add background BuildKit pruning for completed batches by neubig · Pull Request #489 · OpenHands/benchmarks

neubig · 2026-03-08T14:42:21Z

Summary

add targeted BuildKit pruning filters for completed image batches
run completed-batch prune work in the background with a one-batch delay and timeout
keep the existing threshold-based prune as a synchronous fallback when no background prune is active
add focused tests for the new pruning behavior

Details

Historical investigation across the Jan 8, Feb 5, Feb 19-20, and Mar 3-4 SWE-bench image-build runs suggests the worst failures were not a single monotonic code regression.
Low-footprint and high-footprint runs both included true full or near-full rebuilds; the strongest explanation is warm vs. cold registry-backed BuildKit cache rather than just final-image skip behavior.
The problematic March runs reached roughly 594-599 GiB of BuildKit usage by batch 19 and were still making progress when canceled (artifacts contained 471 and 285 manifest entries), which points to extreme slowdown with long synchronous docker buildx prune stalls rather than a hard deadlock.
This PR moves that prune work off the critical path by pruning completed batches asynchronously, with a one-batch delay so the just-finished batch is eligible while the currently running batch is not.

Testing

uv run pytest tests/test_buildx_utils.py
uv run pre-commit run --files benchmarks/utils/build_utils.py benchmarks/utils/buildx_utils.py tests/test_buildx_utils.py
Triggered a full minimax-m2.5 SWE-bench Verified run against OpenHands/software-agent-sdk using this benchmarks branch.

Evidence

Verification link: View conversation

Full validation workflows completed successfully:

$ gh run view 22824268127 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Run Eval (swebench) PR-489 background BuildKit prune validation","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22824268127"}

$ gh run view 22824279365 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Eval Job (swebench) PR-489 background BuildKit prune validation","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22824279365"}

Checklist

CI passing
Tests are minimal and pass
No unnecessary code
Evidence from live run (with conversation link if available)
All review comments resolved
Documentation updated (if applicable)

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-03-08T15:37:49Z

@OpenHands this has not been completely tested, try testing a run using this by triggering a full run with minimax-m2.5 and swe-bench-verified on OpenHands/software-agent-sdk

openhands-ai · 2026-03-08T15:38:09Z

I'm on it! neubig can track my progress at all-hands.dev

openhands-ai · 2026-03-08T15:42:06Z

I triggered the requested full validation run for this PR using minimax-m2.5 on SWE-Bench Verified against OpenHands/software-agent-sdk.

Summary:

Reviewed PR Add background BuildKit pruning for completed batches #489 context and the diff against main.
Confirmed the correct eval path is the SDK run-eval.yml workflow with:
- benchmark=swebench → princeton-nlp/SWE-bench_Verified
- eval_limit=500 for a full run
- benchmarks_branch=background-batch-buildkit-prune to test this PR branch
Dispatched the run with:
- sdk_ref=main
- allow_unreleased_branches=true
- model_ids=minimax-m2.5
- eval_limit=500

Status:

SDK dispatch workflow completed successfully:
- OpenHands/software-agent-sdk run 22824268127
- https://github.com/OpenHands/software-agent-sdk/actions/runs/22824268127
It successfully launched the actual evaluation:
- OpenHands/evaluation run 22824279365
- https://github.com/OpenHands/evaluation/actions/runs/22824279365
- Status at last check: in_progress

Code changes:

None. No files were modified, committed, or pushed.

all-hands-bot

🟡 Acceptable - Pragmatic Solution with Minor Issues

THE GOOD:
Solves a real problem (synchronous BuildKit prune stalls during large batch builds) by moving expensive I/O off the critical path. The one-batch delay strategy ensures we don't prune layers from the currently running batch - this is good taste.

THE BAD:
Code duplication in wait() and poll() violates DRY. Tests are overly mocked and don't validate that the actual prune filters work correctly.

VERDICT: ✅ Worth merging - the core logic is sound and the evidence from the full SWE-bench run proves it works. The duplication is annoying but not blocking.

KEY INSIGHT: Moving synchronous prune work to background threads is textbook "make the slow parts async" - pragmatic engineering that solves the actual problem without over-engineering.

all-hands-bot · 2026-03-09T17:47:37Z

benchmarks/utils/buildx_utils.py

+                except Exception as e:
+                    logger.warning("Background BuildKit prune failed: %s", e)
+                finally:
+                    self._future = None


🟠 Important: Code duplication with poll() method.

Lines 136-145 here are identical to lines 115-124 in poll(). Extract this into a helper method:

Suggested change

self._future = None

def _handle_future_result(self) -> None:

"""Process completed prune future and log result."""

try:

pruned_count = self._future.result()

logger.info("Background BuildKit prune completed for %d completed images", pruned_count)

except Exception as e:

logger.warning("Background BuildKit prune failed: %s", e)

finally:

self._future = None

Then both methods can call self._handle_future_result() instead of duplicating this logic.

all-hands-bot · 2026-03-09T17:47:37Z

benchmarks/utils/build_utils.py

+                completed_batch_pruner.poll()
+                if prunable_batch:
+                    completed_batch_pruner.enqueue_completed_batch(prunable_batch)
+                    prunable_batch = []


🟡 Suggestion: Add a comment explaining the one-batch delay strategy.

This is a key design decision - we prune batch N-1 while building batch N - but it's not documented in the code. Consider:

Suggested change

prunable_batch = []

# Prune the previous batch (one-batch delay ensures we don't prune

# layers from the currently running batch)

completed_batch_pruner.poll()

if prunable_batch:

completed_batch_pruner.enqueue_completed_batch(prunable_batch)

prunable_batch = []

all-hands-bot · 2026-03-09T17:47:37Z

tests/test_buildx_utils.py

+
+
+@patch("benchmarks.utils.build_utils.tqdm", FakeTqdm)
+def test_build_all_images_enqueues_previous_successful_batch_for_pruning(


🟡 Suggestion: Tests are heavily mocked and don't validate real behavior.

Consider adding a test that validates buildkit_prune_filters_for_completed_images() generates the correct filter format without mocking everything:

def test_buildkit_prune_filters_generates_valid_docker_filter_syntax(): """Ensure filter output matches docker buildx prune --filter format.""" filters = buildkit_prune_filters_for_completed_images([ "docker.io/repo/image:tag", "another/image:latest" ]) # Verify filter can be used in actual docker command assert len(filters) == 1 assert filters[0].startswith("description~=") # Verify special chars are escaped assert "\\." in filters[0]

The current test verifies orchestration but not that the filters will work with real docker buildx prune.

Add background pruning for completed build batches

6053062

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review March 9, 2026 17:44

all-hands-bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add background BuildKit pruning for completed batches#489

Add background BuildKit pruning for completed batches#489
neubig wants to merge 1 commit intomainfrom
background-batch-buildkit-prune

neubig commented Mar 8, 2026 •

edited

Loading

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 9, 2026

Uh oh!

all-hands-bot Mar 9, 2026

Uh oh!

all-hands-bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@patch("benchmarks.utils.build_utils.tqdm", FakeTqdm)
		def test_build_all_images_enqueues_previous_successful_batch_for_pruning(

Conversation

neubig commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Evidence

Checklist

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

🟡 Acceptable - Pragmatic Solution with Minor Issues

Uh oh!

all-hands-bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Mar 8, 2026 •

edited

Loading