fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock#9486
fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock#9486zb2313 wants to merge 2 commits into
Conversation
…avoid deadlock For large MoE models (e.g. Qwen3.5-122B-A10B) trained with Expert Parallelism, INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP is lazily initialized by PyTorch/NCCL: dist.new_group() only registers metadata; the actual ncclCommInitRankConfig bootstrap fires on first collective use, which happens at the first optimizer step — by which point GPU memory is near its limit (~125-130 GiB on H200). If any rank cannot allocate the NCCL bootstrap buffer at that moment, its thread stalls silently (NCCL_TIMEOUT does NOT cover the bootstrap phase), causing all other ranks to wait forever (deadlock with no error output). Fix: call a no-op barrier() on this group immediately after initialize_model_parallel() returns, while GPU memory is still empty, forcing NCCL bootstrap at a safe time. The guard on get_inter_distributed_optimizer_ instance_group(check_initialized=False) returns None for dense models or EP=1, so this change is a no-op for all non-MoE configurations.
There was a problem hiding this comment.
Code Review
This pull request pre-initializes the INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP NCCL communicator during MPU initialization to prevent a lazy-initialization deadlock during the first optimizer step when GPU memory is near its limit. The review feedback suggests checking if the get_inter_distributed_optimizer_instance_group attribute exists on mpu before calling it, ensuring backward compatibility with older or customized versions of megatron-core and preventing potential AttributeError crashes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Thanks for the feedback. I've added a compatibility check using `hasattr()` before calling `get_inter_distributed_optimizer_instance_group()` and pushed an update. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Problem
When training large MoE models (e.g. Qwen3.5-122B-A10B) with Expert Parallelism (EP=8, 4 nodes × 8 GPUs), training hangs permanently at the first optimizer step with no error output. After NCCL_TIMEOUT, the watchdog outputs:
Watchdog caught collective operation timeout. Root cause: PyTorch lazily initializes NCCL communicators.dist.new_group()only registers metadata; the actualncclCommInitRankConfigbootstrap fires on the first collective use. ForINTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP(the cross-node expert DP group), this first use is the very first optimizer step, by which point GPU memory is near its limit (e.g. ~128/140 GiB on H200). If any rank cannot allocate the NCCL bootstrap buffer, its bootstrap thread stalls silently —NCCL_TIMEOUTdoes not cover the bootstrap phase —and all other ranks wait forever.If any rank cannot allocate the NCCL bootstrap buffer, its bootstrap thread stalls silently —
NCCL_TIMEOUTdoes not cover the bootstrap phase — and all other ranks wait forever.Fix
In
_initialize_mpu(), immediately aftermpu.initialize_model_parallel()returns (while GPU memory is still nearly empty), call a barrier on this group to force NCCL bootstrap at a safe time. Uses the public APImpu.get_inter_distributed_optimizer_instance_group(check_initialized=False), which returnsNonefor dense models or whenEP=1, making this change a no-op for all non-MoE configurations.Tested on