Add background BuildKit pruning for completed batches#489
Add background BuildKit pruning for completed batches#489
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands this has not been completely tested, try testing a run using this by triggering a full run with minimax-m2.5 and swe-bench-verified on OpenHands/software-agent-sdk |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
I triggered the requested full validation run for this PR using Summary:
Status:
Code changes:
|
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic Solution with Minor Issues
THE GOOD:
Solves a real problem (synchronous BuildKit prune stalls during large batch builds) by moving expensive I/O off the critical path. The one-batch delay strategy ensures we don't prune layers from the currently running batch - this is good taste.
THE BAD:
Code duplication in wait() and poll() violates DRY. Tests are overly mocked and don't validate that the actual prune filters work correctly.
VERDICT: ✅ Worth merging - the core logic is sound and the evidence from the full SWE-bench run proves it works. The duplication is annoying but not blocking.
KEY INSIGHT: Moving synchronous prune work to background threads is textbook "make the slow parts async" - pragmatic engineering that solves the actual problem without over-engineering.
| except Exception as e: | ||
| logger.warning("Background BuildKit prune failed: %s", e) | ||
| finally: | ||
| self._future = None |
There was a problem hiding this comment.
🟠 Important: Code duplication with poll() method.
Lines 136-145 here are identical to lines 115-124 in poll(). Extract this into a helper method:
| self._future = None | |
| def _handle_future_result(self) -> None: | |
| """Process completed prune future and log result.""" | |
| try: | |
| pruned_count = self._future.result() | |
| logger.info("Background BuildKit prune completed for %d completed images", pruned_count) | |
| except Exception as e: | |
| logger.warning("Background BuildKit prune failed: %s", e) | |
| finally: | |
| self._future = None |
Then both methods can call self._handle_future_result() instead of duplicating this logic.
| completed_batch_pruner.poll() | ||
| if prunable_batch: | ||
| completed_batch_pruner.enqueue_completed_batch(prunable_batch) | ||
| prunable_batch = [] |
There was a problem hiding this comment.
🟡 Suggestion: Add a comment explaining the one-batch delay strategy.
This is a key design decision - we prune batch N-1 while building batch N - but it's not documented in the code. Consider:
| prunable_batch = [] | |
| # Prune the previous batch (one-batch delay ensures we don't prune | |
| # layers from the currently running batch) | |
| completed_batch_pruner.poll() | |
| if prunable_batch: | |
| completed_batch_pruner.enqueue_completed_batch(prunable_batch) | |
| prunable_batch = [] |
|
|
||
|
|
||
| @patch("benchmarks.utils.build_utils.tqdm", FakeTqdm) | ||
| def test_build_all_images_enqueues_previous_successful_batch_for_pruning( |
There was a problem hiding this comment.
🟡 Suggestion: Tests are heavily mocked and don't validate real behavior.
Consider adding a test that validates buildkit_prune_filters_for_completed_images() generates the correct filter format without mocking everything:
def test_buildkit_prune_filters_generates_valid_docker_filter_syntax():
"""Ensure filter output matches docker buildx prune --filter format."""
filters = buildkit_prune_filters_for_completed_images([
"docker.io/repo/image:tag",
"another/image:latest"
])
# Verify filter can be used in actual docker command
assert len(filters) == 1
assert filters[0].startswith("description~=")
# Verify special chars are escaped
assert "\\." in filters[0]The current test verifies orchestration but not that the filters will work with real docker buildx prune.
Summary
Details
docker buildx prunestalls rather than a hard deadlock.Testing
uv run pytest tests/test_buildx_utils.pyuv run pre-commit run --files benchmarks/utils/build_utils.py benchmarks/utils/buildx_utils.py tests/test_buildx_utils.pyminimax-m2.5SWE-bench Verified run againstOpenHands/software-agent-sdkusing this benchmarks branch.Evidence
Verification link: View conversation
Full validation workflows completed successfully:
$ gh run view 22824268127 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url {"conclusion":"success","displayTitle":"Run Eval (swebench) PR-489 background BuildKit prune validation","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22824268127"} $ gh run view 22824279365 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url {"conclusion":"success","displayTitle":"Eval Job (swebench) PR-489 background BuildKit prune validation","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22824279365"}Checklist