feat: implement Group.split() for JACCL and Ring backends#3245
Open
machiabeli wants to merge 1 commit intoml-explore:mainfrom
Open
feat: implement Group.split() for JACCL and Ring backends#3245machiabeli wants to merge 1 commit intoml-explore:mainfrom
machiabeli wants to merge 1 commit intoml-explore:mainfrom
Conversation
Adds `Group.split()` (MPI_Comm_split semantics) to the JACCL mesh, JACCL ring, and TCP ring distributed backends — enabling mixed parallelism strategies (e.g. tensor parallelism + pipeline parallelism) on Apple Silicon clusters. Key design decisions: 1. **TCPGroup fallback for JACCL sub-groups**: Apple's Thunderbolt 5 RDMA driver does not support multiple `ibv_context` instances on the same physical device simultaneously. Sub-groups derived from an RDMA parent would deadlock if they tried to open new RDMA connections. Solution: sub-groups use TCP (via SideChannel) for collective operations, keeping RDMA reserved for the parent group's high-bandwidth tensor parallelism. 2. **SideChannel synchronization**: All ranks must call the same sequence of `SideChannel::all_gather` operations before any rank branches or returns early. This matches MPI_Comm_split's collective semantics and prevents deadlocks from asymmetric participation. 3. **LocalGroup for size-1 sub-groups**: Ranks that end up alone in their color group get a lightweight no-op implementation that avoids both RDMA and TCP overhead. 4. **Coordinator derivation**: Sub-group coordinators are derived from the parent's coordinator address with port offsets based on the color parameter, ensuring no port collisions between sub-groups. Backends modified: - `jaccl/mesh.cpp`: MeshGroup::split() with RDMA-aware sub-group creation - `jaccl/ring.cpp`: RingGroup::split() + TCPGroup (full collective ops over TCP) + RingLocalGroup - `ring/ring.cpp`: RingGroup::split() using ring all-gather for info exchange and port-offset addressing for sub-ring creation Tested on a 4-node Apple Silicon cluster (3x M3 Ultra + 1x M4 Max) connected via Thunderbolt 5 RDMA using the jaccl-ring backend. Closes ml-explore#3205
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements
Group.split()(MPI_Comm_split semantics) for the JACCL mesh, JACCL ring, and TCP ring distributed backends. This enables mixed parallelism strategies — such as combining tensor parallelism with pipeline parallelism — on Apple Silicon clusters connected via Thunderbolt 5 RDMA.Addresses #3205
Problem
Group.split()was unimplemented for all Apple Silicon distributed backends (throw std::runtime_error("Group split not supported")), blocking any use of sub-group collectives. This prevented:Key Design Decisions
1. TCPGroup Fallback for JACCL Sub-Groups
Apple's Thunderbolt 5 RDMA driver does not support multiple
ibv_contextinstances on the same physical RDMA device simultaneously. If a sub-group tries to open new RDMA connections on devices already held by the parent group, the driver deadlocks.Solution: Sub-groups created via
split()on JACCL backends useTCPGroup— a new group implementation that performs all collective operations over TCP (viaSideChannel). This is slower than RDMA but:ibv_contextconcurrency deadlock entirely2. Strict Collective Synchronization
All
SideChannel::all_gathercalls insplit()execute before any rank makes a branching decision (e.g., returning aLocalGroupfor size-1 sub-groups). This matches MPI_Comm_split's collective semantics — the MPICH implementation internally usesMPI_Allgatherwith the same constraint.3. LocalGroup for Size-1 Sub-Groups
Ranks that end up alone in their color group get a lightweight no-op implementation (identity for reductions, memcpy for gather). No RDMA or TCP connections opened.
4. Coordinator Address Derivation
Sub-group coordinators reuse the parent's coordinator host with port offsets based on the
colorparameter:parent_port + 1000 + color. This ensures no port collisions between concurrent sub-groups.Changes
mlx/distributed/jaccl/mesh.hdevice_names_,coordinator_host_,coordinator_port_for split()mlx/distributed/jaccl/mesh.cppMeshGroup::split()+LocalGroupclassmlx/distributed/jaccl/ring.hall_devices_, coordinator info for split()mlx/distributed/jaccl/ring.cppRingGroup::split()+TCPGroup(full collective ops over TCP) +RingLocalGroupmlx/distributed/ring/ring.cppRingGroup::split()using ring all-gather + port-offset sub-ring creationtests/test_jaccl_split.pyTesting
Tested on a 4-node Apple Silicon cluster:
All 5 tests pass: same-color split, even/odd two-group split, three-group split with size-1 sub-groups, key-reversed ordering, and send/recv on sub-groups (gracefully skipped for TCPGroup which doesn't support point-to-point).
Hardware Context
This was developed and tested against Apple's first-generation Thunderbolt 5 RDMA driver (macOS 26.x,
infiniband/verbs.hfrom Xcode SDK). Theibv_contextconcurrency limitation is specific to this driver — standard InfiniBand hardware (Mellanox ConnectX) supports multiple contexts via SR-IOV. The TCPGroup fallback is designed to be replaced with direct RDMA sub-group connections if/when Apple adds multi-context support.