Skip to content

fix(base): delete outer Communicator in CommDestroy#15

Merged
Ziminli merged 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy
May 20, 2026
Merged

fix(base): delete outer Communicator in CommDestroy#15
Ziminli merged 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy

Conversation

@GordonYang1
Copy link
Copy Markdown
Collaborator

@GordonYang1 GordonYang1 commented May 19, 2026

Summary

Fixes the Communicator lifetime leak noted in the "Known Issues & Future Work" section of #10.

CommInitAll::Execute allocates the outer Communicator via new, but the previous CommDestroy only tore down the backend instance and never deleted the outer object. Every infiniCommInitAll / infiniCommDestroy pair leaked one Communicator.

Changes

  • src/base/comm_destroy.h:
    • Narrow Execute to a concrete void *comm_handle signature, mirroring CommInitAll::Execute's void **comm_handle.
    • Add a null-handle check, consistent with CommInitAll.
    • Delete the Communicator only after the backend Apply returns kSuccess.

Backend implementations are intentionally untouched: ownership of the outer Communicator now lives entirely in the base layer, symmetric with the allocation in CommInitAll. Future backends such as NCCL, MCCL, HCCL, and others get correct lifetime handling without each backend having to remember to delete the outer object.

Test environment

Validated on a heterogeneous 2-node cluster with container-to-container direct connection over RDMA:

Node Hardware Container IP
A NVIDIA A100 iccl-nvidia 192.168.163.40
B MetaX C550 iccl-metax-clean 192.168.162.49
  • OS / container: --network host --ipc host --privileged, /dev/infiniband mounted on both sides.
  • OpenMPI: 4.1.6 (/opt/openmpi-4.1.6), built with --with-ucx=/opt/ucx-1.17.0.
  • UCX: 1.17.0 (/opt/ucx-1.17.0), built with --with-verbs --with-rdmacm.
  • CUDA Toolkit on Node A; MACA SDK at /opt/maca on Node B.
  • Inter-container SSH direct connection on port 22222, with no docker exec wrapper required for mpirun or icclrun --build.
  • UCX transport configuration:
    • UCX_NET_DEVICES=mlx5_0:1
    • UCX_TLS=rc,rc_verbs,self,sm
    • UCX_RNDV_SCHEME=put_zcopy
  • World size: 16 ranks, 8 NVIDIA + 8 MetaX.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log

all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log

broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log

all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log

Copy link
Copy Markdown
Collaborator

@Ziminli Ziminli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同时麻烦 rebase 到最新,补充新增的示例程序运行日志文件。

Comment thread src/base/comm_destroy.h Outdated
Comment thread src/base/comm_destroy.h Outdated
@GordonYang1 GordonYang1 force-pushed the fix/delete-communicator-on-comm-destroy branch 2 times, most recently from f9e299e to a768919 Compare May 20, 2026 02:42
@GordonYang1 GordonYang1 force-pushed the fix/delete-communicator-on-comm-destroy branch from a768919 to 86ef0a1 Compare May 20, 2026 03:18
@Ziminli
Copy link
Copy Markdown
Collaborator

Ziminli commented May 20, 2026

麻烦修改一下 PR 标题的格式,可以参考其他已合入的 PR。

@GordonYang1 GordonYang1 changed the title fix/comm-destroy-communicator-leak fix(base): delete outer Communicator in CommDestroy May 20, 2026
@Ziminli Ziminli self-requested a review May 20, 2026 09:35
@Ziminli Ziminli merged commit 98e4be6 into InfiniTensor:master May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants