fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock by zb2313 · Pull Request #9486 · modelscope/ms-swift

zb2313 · 2026-06-03T15:59:42Z

Problem

When training large MoE models (e.g. Qwen3.5-122B-A10B) with Expert Parallelism (EP=8, 4 nodes × 8 GPUs), training hangs permanently at the first optimizer step with no error output. After NCCL_TIMEOUT, the watchdog outputs: Watchdog caught collective operation timeout. Root cause: PyTorch lazily initializes NCCL communicators. dist.new_group() only registers metadata; the actualncclCommInitRankConfig bootstrap fires on the first collective use. For INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP (the cross-node expert DP group), this first use is the very first optimizer step, by which point GPU memory is near its limit (e.g. ~128/140 GiB on H200). If any rank cannot allocate the NCCL bootstrap buffer, its bootstrap thread stalls silently — NCCL_TIMEOUT does not cover the bootstrap phase —and all other ranks wait forever.

If any rank cannot allocate the NCCL bootstrap buffer, its bootstrap thread stalls silently — NCCL_TIMEOUT does not cover the bootstrap phase — and all other ranks wait forever.

Fix

In _initialize_mpu(), immediately after mpu.initialize_model_parallel() returns (while GPU memory is still nearly empty), call a barrier on this group to force NCCL bootstrap at a safe time. Uses the public API mpu.get_inter_distributed_optimizer_instance_group(check_initialized=False), which returns None for dense models or when EP=1, making this change a no-op for all non-MoE configurations.

Tested on

Qwen3.5-122B-A10B, 4 nodes × 8 × H200, EP=8, TP=1, PP=1
megatron-core: 0.17
Without this fix: deadlock at step 1, every run
With this fix: training runs to completion, no deadlock

…avoid deadlock For large MoE models (e.g. Qwen3.5-122B-A10B) trained with Expert Parallelism, INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP is lazily initialized by PyTorch/NCCL: dist.new_group() only registers metadata; the actual ncclCommInitRankConfig bootstrap fires on first collective use, which happens at the first optimizer step — by which point GPU memory is near its limit (~125-130 GiB on H200). If any rank cannot allocate the NCCL bootstrap buffer at that moment, its thread stalls silently (NCCL_TIMEOUT does NOT cover the bootstrap phase), causing all other ranks to wait forever (deadlock with no error output). Fix: call a no-op barrier() on this group immediately after initialize_model_parallel() returns, while GPU memory is still empty, forcing NCCL bootstrap at a safe time. The guard on get_inter_distributed_optimizer_ instance_group(check_initialized=False) returns None for dense models or EP=1, so this change is a no-op for all non-MoE configurations.

gemini-code-assist

Code Review

This pull request pre-initializes the INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP NCCL communicator during MPU initialization to prevent a lazy-initialization deadlock during the first optimizer step when GPU memory is near its limit. The review feedback suggests checking if the get_inter_distributed_optimizer_instance_group attribute exists on mpu before calling it, ensuring backward compatibility with older or customized versions of megatron-core and preventing potential AttributeError crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Thanks for the feedback. I've added a compatibility check using `hasattr()` before calling `get_inter_distributed_optimizer_instance_group()` and pushed an update. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread swift/megatron/utils/megatron_lm_utils.py Outdated

Jintao-Huang mentioned this pull request Jun 4, 2026

GLM5.1 MoE + PP 训练卡在 Train 0/100：batch_p2p_comm=True 但实际触发 unbatched P2P send/recv lazy NCCL communicator init #9451

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock#9486

fix(megatron): pre-initialize NCCL communicator for MoE expert DP group to prevent lazy-init deadlock#9486
zb2313 wants to merge 2 commits into
modelscope:mainfrom
zb2313:fix/moe-nccl-lazy-init-deadlock

zb2313 commented Jun 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zb2313 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tested on

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zb2313 commented Jun 3, 2026 •

edited

Loading