fix(base): delete outer Communicator in CommDestroy#15
Merged
Ziminli merged 1 commit intoMay 20, 2026
Merged
Conversation
Ziminli
requested changes
May 20, 2026
Collaborator
Ziminli
left a comment
There was a problem hiding this comment.
同时麻烦 rebase 到最新,补充新增的示例程序运行日志文件。
f9e299e to
a768919
Compare
a768919 to
86ef0a1
Compare
Collaborator
|
麻烦修改一下 PR 标题的格式,可以参考其他已合入的 PR。 |
Communicator in CommDestroy
Ziminli
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the
Communicatorlifetime leak noted in the "Known Issues & Future Work" section of #10.CommInitAll::Executeallocates the outerCommunicatorvianew, but the previousCommDestroyonly tore down the backend instance and never deleted the outer object. EveryinfiniCommInitAll/infiniCommDestroypair leaked oneCommunicator.Changes
src/base/comm_destroy.h:Executeto a concretevoid *comm_handlesignature, mirroringCommInitAll::Execute'svoid **comm_handle.CommInitAll.Communicatoronly after the backendApplyreturnskSuccess.Backend implementations are intentionally untouched: ownership of the outer
Communicatornow lives entirely in the base layer, symmetric with the allocation inCommInitAll. Future backends such as NCCL, MCCL, HCCL, and others get correct lifetime handling without each backend having to remember todeletethe outer object.Test environment
Validated on a heterogeneous 2-node cluster with container-to-container direct connection over RDMA:
iccl-nvidia192.168.163.40iccl-metax-clean192.168.162.49--network host --ipc host --privileged,/dev/infinibandmounted on both sides./opt/openmpi-4.1.6), built with--with-ucx=/opt/ucx-1.17.0./opt/ucx-1.17.0), built with--with-verbs --with-rdmacm./opt/macaon Node B.22222, with nodocker execwrapper required formpirunoricclrun --build.UCX_NET_DEVICES=mlx5_0:1UCX_TLS=rc,rc_verbs,self,smUCX_RNDV_SCHEME=put_zcopyLogs & Screenshots
all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log
all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log
reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log
broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log
all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log