Skip to content

Recycle worker processes to prevent OOM from heap fragmentation#442

Open
simonrosenberg wants to merge 2 commits intomainfrom
fix/recycle-workers-reduce-memory
Open

Recycle worker processes to prevent OOM from heap fragmentation#442
simonrosenberg wants to merge 2 commits intomainfrom
fix/recycle-workers-reduce-memory

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

  • Add max_tasks_per_child=10 to ProcessPoolExecutor to recycle worker processes, releasing accumulated heap fragmentation back to the OS
  • Replace model_copy(deep=True) with model_copy() (shallow copy) for EvalOutput.metadata — only .lmnr is mutated on the copy, so deep copying nested LLM config, env vars, etc. is unnecessary

Context

PR #434 fixed parent process memory accumulation by releasing EvalOutput.history after disk write. However, worker processes (30 of them) each have their own Python heap that grows independently. CPython's pymalloc allocator marks freed blocks as reusable but does not return them to the OS, so each worker's RSS grows monotonically.

With 30 workers running for 5+ hours (e.g. eval-22318536157-qwen3-code-fjvsl which OOMed after 5h4m), the combined RSS exceeds the 8Gi container limit:

  • 8Gi / 30 workers = ~267MB per worker
  • After 10+ instances with fragmentation: 400-800MB+ per worker
  • 30 × 500MB = ~15GB >> 8Gi limit

Performance tradeoff of MAX_TASKS_PER_CHILD

Set to 10 as a constant. With ~17 instances per worker (500 total / 30 workers), each worker restarts ~1-2 times during the run. The restart cost (~1-2s for process spawn + module re-import) is negligible compared to per-instance runtime (minutes).

Test plan

  • Run a SWTBench evaluation with 30 workers and verify it completes without OOM
  • Verify evaluation results are identical to a run without this change
  • Monitor container memory usage — should show periodic drops as workers recycle instead of monotonic growth

Fixes #441

🤖 Generated with Claude Code

Workers in ProcessPoolExecutor accumulate fragmented memory over time
because CPython's pymalloc does not return freed memory to the OS.
With 30 workers processing 500+ instances over 5+ hours, RSS grows
monotonically until the container hits its memory limit and is OOMKilled.

Add max_tasks_per_child=10 to recycle workers every 10 instances,
releasing accumulated heap fragmentation back to the OS.

Fixes #441

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg force-pushed the fix/recycle-workers-reduce-memory branch from 1cd2110 to 5aa36b6 Compare February 24, 2026 10:38
Temporarily remove max_tasks_per_child to determine if the startup OOM
is caused by the code change or by the fresh Docker image build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOM failures persist in SWTBench evaluations despite fix in #433

1 participant