Fix three silent hang causes in unet3d training benchmark#394
Open
FileSystemGuy wants to merge 5 commits into
Open
Fix three silent hang causes in unet3d training benchmark#394FileSystemGuy wants to merge 5 commits into
FileSystemGuy wants to merge 5 commits into
Conversation
- utils.py: replace readline() with buffer.read1() in CommandExecutor to prevent blocking on partial/\r-terminated output from DLIO/MPI processes - cluster_collector.py: wrap collect_local_info() in try/except inside MPI_COLLECTOR_SCRIPT so every rank always reaches comm.gather(), preventing a deadlock when any rank exits early due to a collection error - tests/unit/test_cluster_collector.py: add TestMPICollectorScriptMain with 6 tests covering the gather-always-called invariant
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
idevasena
previously approved these changes
May 26, 2026
Contributor
idevasena
left a comment
There was a problem hiding this comment.
Tested it on my end and fix works @FileSystemGuy
Handoff for the third unet3d hang fix. The pipe-drain blocking .read() replacement (select+read1 with 0.5s timeout) is applied but not yet committed. See .planning/.continue-here.md for full context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…locking on orphaned DataLoader workers PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init. When mpirun exits (its direct children done), those orphaned grandchildren still hold the pipe write-end open, causing TextIOWrapper.read() in the post-loop drain to block indefinitely. Replacing the blocking .read() calls with a select()+read1() loop (0.5s timeout) un-sticks the orchestrator; the normal (no-orphan) case completes immediately on EOF with zero added latency.
idevasena
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
utils.py: Replacereadline()withbuffer.read1()inCommandExecutorto prevent an indefinite block when DLIO or MPI writes\r-terminated progress output or any partial line without a trailing newline.cluster_collector.py: Wrapcollect_local_info()in try/except insideMPI_COLLECTOR_SCRIPTso every rank unconditionally reachescomm.gather(), preventing a deadlock when any rank exits early due to a collection error.utils.py(2e74360): Replace the post-loop blockingTextIOWrapper.read()drain calls with aselect()+read1()loop (0.5s timeout). PyTorch DataLoader workers are forked inside DLIO MPI ranks afterMPI_Init; when mpirun exits, those orphaned grandchildren still hold the pipe write-end open, causing the originalread()to block indefinitely. Theselect()approach returns immediately on EOF in the normal (no-orphan) case — zero added latency — and times out after 0.5s when orphans are present, un-sticking the orchestrator.tests/unit/test_cluster_collector.py: AddTestMPICollectorScriptMain(6 tests) covering the gather-always-called invariant, using exec() + mocked mpi4py — no real MPI processes required.Test plan
uv run pytest tests/unit/test_cluster_collector.py::TestMPICollectorScriptMain -v— all 6 new tests passuv run pytest tests/unit -q— no regressions (956 passed, 8 pre-existing failures unrelated to these changes)uv run pytest tests/unit -qafter 2e74360 — 176 tests pass, no regressions