Skip to content

Fix three silent hang causes in unet3d training benchmark#394

Open
FileSystemGuy wants to merge 5 commits into
mainfrom
FileSystemGuy-une3d-hang
Open

Fix three silent hang causes in unet3d training benchmark#394
FileSystemGuy wants to merge 5 commits into
mainfrom
FileSystemGuy-une3d-hang

Conversation

@FileSystemGuy
Copy link
Copy Markdown
Contributor

@FileSystemGuy FileSystemGuy commented May 26, 2026

Summary

  • utils.py: Replace readline() with buffer.read1() in CommandExecutor to prevent an indefinite block when DLIO or MPI writes \r-terminated progress output or any partial line without a trailing newline.
  • cluster_collector.py: Wrap collect_local_info() in try/except inside MPI_COLLECTOR_SCRIPT so every rank unconditionally reaches comm.gather(), preventing a deadlock when any rank exits early due to a collection error.
  • utils.py (2e74360): Replace the post-loop blocking TextIOWrapper.read() drain calls with a select()+read1() loop (0.5s timeout). PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init; when mpirun exits, those orphaned grandchildren still hold the pipe write-end open, causing the original read() to block indefinitely. The select() approach returns immediately on EOF in the normal (no-orphan) case — zero added latency — and times out after 0.5s when orphans are present, un-sticking the orchestrator.
  • tests/unit/test_cluster_collector.py: Add TestMPICollectorScriptMain (6 tests) covering the gather-always-called invariant, using exec() + mocked mpi4py — no real MPI processes required.

Test plan

  • uv run pytest tests/unit/test_cluster_collector.py::TestMPICollectorScriptMain -v — all 6 new tests pass
  • uv run pytest tests/unit -q — no regressions (956 passed, 8 pre-existing failures unrelated to these changes)
  • uv run pytest tests/unit -q after 2e74360 — 176 tests pass, no regressions

- utils.py: replace readline() with buffer.read1() in CommandExecutor to
  prevent blocking on partial/\r-terminated output from DLIO/MPI processes

- cluster_collector.py: wrap collect_local_info() in try/except inside
  MPI_COLLECTOR_SCRIPT so every rank always reaches comm.gather(), preventing
  a deadlock when any rank exits early due to a collection error

- tests/unit/test_cluster_collector.py: add TestMPICollectorScriptMain with
  6 tests covering the gather-always-called invariant
@FileSystemGuy FileSystemGuy requested a review from a team May 26, 2026 18:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 26, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

idevasena
idevasena previously approved these changes May 26, 2026
Copy link
Copy Markdown
Contributor

@idevasena idevasena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it on my end and fix works @FileSystemGuy

FileSystemGuy and others added 2 commits May 26, 2026 12:53
Handoff for the third unet3d hang fix. The pipe-drain blocking .read()
replacement (select+read1 with 0.5s timeout) is applied but not yet
committed. See .planning/.continue-here.md for full context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…locking on orphaned DataLoader workers

PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init.
When mpirun exits (its direct children done), those orphaned grandchildren
still hold the pipe write-end open, causing TextIOWrapper.read() in the
post-loop drain to block indefinitely. Replacing the blocking .read() calls
with a select()+read1() loop (0.5s timeout) un-sticks the orchestrator;
the normal (no-orphan) case completes immediately on EOF with zero added latency.
@FileSystemGuy FileSystemGuy changed the title Fix two silent hang causes in unet3d training benchmark Fix three silent hang causes in unet3d training benchmark May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants