Recycle worker processes to prevent OOM from heap fragmentation#442
Open
simonrosenberg wants to merge 2 commits intomainfrom
Open
Recycle worker processes to prevent OOM from heap fragmentation#442simonrosenberg wants to merge 2 commits intomainfrom
simonrosenberg wants to merge 2 commits intomainfrom
Conversation
Workers in ProcessPoolExecutor accumulate fragmented memory over time because CPython's pymalloc does not return freed memory to the OS. With 30 workers processing 500+ instances over 5+ hours, RSS grows monotonically until the container hits its memory limit and is OOMKilled. Add max_tasks_per_child=10 to recycle workers every 10 instances, releasing accumulated heap fragmentation back to the OS. Fixes #441 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1cd2110 to
5aa36b6
Compare
Temporarily remove max_tasks_per_child to determine if the startup OOM is caused by the code change or by the fresh Docker image build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
max_tasks_per_child=10toProcessPoolExecutorto recycle worker processes, releasing accumulated heap fragmentation back to the OSmodel_copy(deep=True)withmodel_copy()(shallow copy) forEvalOutput.metadata— only.lmnris mutated on the copy, so deep copying nested LLM config, env vars, etc. is unnecessaryContext
PR #434 fixed parent process memory accumulation by releasing
EvalOutput.historyafter disk write. However, worker processes (30 of them) each have their own Python heap that grows independently. CPython'spymallocallocator marks freed blocks as reusable but does not return them to the OS, so each worker's RSS grows monotonically.With 30 workers running for 5+ hours (e.g.
eval-22318536157-qwen3-code-fjvslwhich OOMed after 5h4m), the combined RSS exceeds the 8Gi container limit:Performance tradeoff of
MAX_TASKS_PER_CHILDSet to 10 as a constant. With ~17 instances per worker (500 total / 30 workers), each worker restarts ~1-2 times during the run. The restart cost (~1-2s for process spawn + module re-import) is negligible compared to per-instance runtime (minutes).
Test plan
Fixes #441
🤖 Generated with Claude Code