Skip to content

fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock#9486

Open
zb2313 wants to merge 2 commits into
modelscope:mainfrom
zb2313:fix/moe-nccl-lazy-init-deadlock
Open

fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock#9486
zb2313 wants to merge 2 commits into
modelscope:mainfrom
zb2313:fix/moe-nccl-lazy-init-deadlock

Conversation

@zb2313

@zb2313 zb2313 commented Jun 3, 2026

Copy link
Copy Markdown

Problem

When training large MoE models (e.g. Qwen3.5-122B-A10B) with Expert Parallelism (EP=8, 4 nodes × 8 GPUs), training hangs permanently at the first optimizer step with no error output. After NCCL_TIMEOUT, the watchdog outputs: Watchdog caught collective operation timeout. Root cause: PyTorch lazily initializes NCCL communicators. dist.new_group() only registers metadata; the actualncclCommInitRankConfig bootstrap fires on the first collective use. For INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP (the cross-node expert DP group), this first use is the very first optimizer step, by which point GPU memory is near its limit (e.g. ~128/140 GiB on H200). If any rank cannot allocate the NCCL bootstrap buffer, its bootstrap thread stalls silently — NCCL_TIMEOUT does not cover the bootstrap phase —and all other ranks wait forever.

If any rank cannot allocate the NCCL bootstrap buffer, its bootstrap thread stalls silently — NCCL_TIMEOUT does not cover the bootstrap phase — and all other ranks wait forever.

Fix

In _initialize_mpu(), immediately after mpu.initialize_model_parallel() returns (while GPU memory is still nearly empty), call a barrier on this group to force NCCL bootstrap at a safe time. Uses the public API mpu.get_inter_distributed_optimizer_instance_group(check_initialized=False), which returns None for dense models or when EP=1, making this change a no-op for all non-MoE configurations.

Tested on

  • Qwen3.5-122B-A10B, 4 nodes × 8 × H200, EP=8, TP=1, PP=1
  • megatron-core: 0.17
  • Without this fix: deadlock at step 1, every run
  • With this fix: training runs to completion, no deadlock

…avoid deadlock

  For large MoE models (e.g. Qwen3.5-122B-A10B) trained with Expert Parallelism,
  INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP is lazily initialized by PyTorch/NCCL:
  dist.new_group() only registers metadata; the actual ncclCommInitRankConfig
  bootstrap fires on first collective use, which happens at the first optimizer
  step — by which point GPU memory is near its limit (~125-130 GiB on H200).

  If any rank cannot allocate the NCCL bootstrap buffer at that moment, its
  thread stalls silently (NCCL_TIMEOUT does NOT cover the bootstrap phase),
  causing all other ranks to wait forever (deadlock with no error output).

  Fix: call a no-op barrier() on this group immediately after
  initialize_model_parallel() returns, while GPU memory is still empty, forcing
  NCCL bootstrap at a safe time. The guard on get_inter_distributed_optimizer_
  instance_group(check_initialized=False) returns None for dense models or EP=1,
  so this change is a no-op for all non-MoE configurations.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request pre-initializes the INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP NCCL communicator during MPU initialization to prevent a lazy-initialization deadlock during the first optimizer step when GPU memory is near its limit. The review feedback suggests checking if the get_inter_distributed_optimizer_instance_group attribute exists on mpu before calling it, ensuring backward compatibility with older or customized versions of megatron-core and preventing potential AttributeError crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread swift/megatron/utils/megatron_lm_utils.py Outdated
Thanks for the feedback. I've added a compatibility check using `hasattr()` before calling `get_inter_distributed_optimizer_instance_group()` and pushed an update.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant