Skip to content

Add: orch-driven dynamic CommDomain allocation (Option A)#817

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:orch-dynamic-comm-domain
May 20, 2026
Merged

Add: orch-driven dynamic CommDomain allocation (Option A)#817
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:orch-dynamic-comm-domain

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 19, 2026

Summary

Adds a runtime path for allocating CommDomain windows from inside the orch function, alongside the existing static Worker(comm_plan=...) path. Implements the design discussed in the RFC sketch — every "Option A" decision is reflected in code.

Each orch.allocate_domain(...) pays a collective cost (~hundreds of ms on HCCL: aclrtMalloc + IPC import; sim: shm_open + ftruncate + ready barrier), sized to actual need rather than worst-case-pre-declared. Static topology (Worker(comm_plan=...)) is unchanged and remains the right choice when sizes are known at init.

Public surface

def orch_fn(orch, args, cfg):
    with orch.allocate_domain(
        name="tp",
        workers=[0, 1],
        window_size=4 * 1024 * 1024,
        buffers=[CommBufferSpec("scratch", "float32", n, n*4)],
    ) as tp:
        for chip_idx in tp.workers:
            sub_args = TaskArgs()
            sub_args.add_scalar(tp[chip_idx].device_ctx)
            sub_args.add_tensor(..., data=tp[chip_idx].buffer_ptrs["scratch"])
            orch.submit_next_level(cid, sub_args, cfg, worker=chip_idx)
    # `with`-exit auto-releases across all participating chips.

Implementation layers

Layer What
C ABI comm_alloc_domain_windows / comm_release_domain_windows in sim, HCCL, a5 stub. HCCL mirrors alloc_windows_via_ipc but scopes every barrier / IPC announce by allocation_id. P2P routes inherited from base alloc.
ChipWorker C++ wrappers + nanobind binding. Both return (device_ctx, local_window_base).
Mailbox CTRL_ALLOC_DOMAIN=7 / CTRL_RELEASE_DOMAIN=8. Variable payload via per-call POSIX shms (request + reply); two shm names staged back-to-back at MAILBOX_OFF_ARGS. WorkerThread holds mailbox_mu_.
_Worker control_alloc_domain(worker_id, …) / control_release_domain(…) expose the per-chip dispatch to Python.
Orchestrator allocate_domain(...) / release_domain(...). Fans out per-chip control calls on threads (one per participating chip) and joins.
Worker _live_domains tracks active handles. Auto-released in run's finally (orch-fn exception) and in close (LIFO). CommDomainHandle is a context manager.

Backwards compatibility

Worker(comm_plan=...) works unchanged. It now also serves a second purpose: declare a minimal "membership" domain so bootstrap_context establishes the HCCL/sim base communicator that dynamic allocations need as their root.

Sim test coverage (new file)

tests/ut/py/test_worker/test_dynamic_alloc_sim.py (5 tests):

  • allocate returns distinct per-chip contexts (non-zero, buffer-ptr in window)
  • with-statement auto-releases on exit
  • alloc-after-release reuses the same name; allocation_id monotonically increments
  • duplicate name while live raises ValueError
  • orch_fn raises mid-DAG → Worker.run releases all live handles before propagating; subsequent live_domains is empty

Plus all existing tests still pass.

Test plan

  • pip install --no-build-isolation -e . rebuilds clean
  • pytest tests/ut/py/test_worker/ → 102 passed, 1 hw skip
  • Pre-commit clean: clang-format / clang-tidy / cpplint / ruff / pyright
  • CI: a2a3 onboard (the real test for HCCL alloc-domain code path)
  • Hardware integration: write/run a 3-chip test that calls orch.allocate_domain and submits PTO-ISA allreduce against the allocated domain

Non-goals

  • Async / Future-based allocation (synchronous only — collective semantics)
  • Cross-host domains (sub-host IPC only, same as base)
  • Resize / grow existing domain (release + reallocate, data lost)
  • Allocating outside an orch function (only Orchestrator exposes it for v1)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements dynamic CommDomain allocation and release, enabling collective window pool allocation for subsets of ranks at runtime. The changes span the Python API, C++ backend implementations for HCCL and simulation, and the hierarchical worker orchestration logic. Review feedback highlights opportunities to improve backend consistency, reduce code duplication in the bootstrap process, optimize worker lookup performance, and ensure the C API parameters align with their documented contracts.

Comment thread src/common/platform_comm/comm_sim.cpp Outdated
Comment thread python/simpler/task_interface.py Outdated
Comment thread python/simpler/worker.py Outdated
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp Outdated
@ChaoWao ChaoWao force-pushed the orch-dynamic-comm-domain branch from 75f3096 to 1009fdb Compare May 19, 2026 07:36
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just glanced over the PR and leave small comments. I suggest the biggest problem is whether we should reuse the same multi-comm-domain path as PR #752 . This new PR seems to have created a completely new path. I'll use AI for more review.

Comment thread python/simpler/task_interface.py
Comment thread python/simpler/task_interface.py
Comment thread tests/ut/py/test_worker/test_dynamic_alloc_sim.py
Comment thread tests/ut/py/test_worker/test_dynamic_alloc_sim.py
@puddingfjz
Copy link
Copy Markdown
Contributor

1. P1: with orch.allocate_domain(...) releases the window before submitted tasks are drained

Orchestrator.submit_next_level() only enqueues DAG work; Worker.run() does not drain until the orch function returns. But CommDomainHandle.__exit__() calls release()
immediately, so the public example with orch.allocate_domain(...) as tp: orch.submit_next_level(... uses tp ...) can release the domain while the chip task is still queued or
running.

mailbox_mu_ only serializes against a dispatch that has already claimed the mailbox; it does not prevent release from running before a queued task that will later use the domain.

Relevant locations:

  • python/simpler/task_interface.py:440-445
  • python/simpler/worker.py:1584-1590
  • src/common/hierarchical/worker_manager.cpp:92-96
  • src/common/hierarchical/worker_manager.cpp:151-160

Suggestion: release needs to be tied to DAG completion, e.g. delayed until drain() after tasks using the domain complete, or represented as an ordered DAG/control node. The current
context-manager semantics create a use-after-free hazard.

2. P1: buffer overflow validation happens after backend allocation, so failed carving can leak a dynamic domain

The chip child calls comm_alloc_domain_windows() first, then validates whether the requested buffers fit in window_size. If sum(buffer.nbytes) > window_size, the backend
allocation has already been registered in domain_allocations, but the parent never constructs a CommDomainHandle or adds it to _live_domains, so normal release cannot clean it
up.

Relevant locations:

  • python/simpler/worker.py:342-359
  • python/simpler/worker.py:1630-1638

Suggestion: validate sum(b.nbytes) <= window_size in the parent before dispatching CTRL_ALLOC_DOMAIN. Also consider best-effort release in the child if any post-allocation step
fails.

3. P2: dynamic windows are not zeroed, unlike static comm windows

Static bootstrap zeros the base comm window after allocation, but dynamic allocation returns freshly allocated memory without clearing it. On sim this may appear as zeroed shm pages,
but on HCCL aclrtMalloc should not be treated as zero-initialized. Scratch/signal protocols can be sensitive to stale values.

Relevant locations:

  • static zeroing: python/simpler/task_interface.py:760-776
  • dynamic alloc reply/context creation: python/simpler/worker.py:1592-1613

Suggestion: zero each participating chip's dynamic local window after allocation, or explicitly document that dynamic windows are uninitialized and add tests that do not depend on
zeroed scratch state. Given static behavior, zeroing is probably the less surprising contract.

Coverage gap

The new tests validate allocate/release mechanics, but they do not submit a real task that uses the dynamic device_ctx / buffer_ptrs. A regression test for the context-manager
case would catch the lifecycle issue above.

Open questions

  • What is the intended lifetime model for a dynamically allocated domain used by submitted DAG tasks? The current context-manager API suggests lexical lifetime, but submitted tasks
    are asynchronous with respect to the orch function body. Should with orch.allocate_domain(...) mean "release when the Python block exits" or "release after all tasks submitted
    inside the block have drained"?

  • Should dynamic domain release be represented as an ordered runtime operation instead of an immediate Python-side control call? If tasks can use the domain after submission, release
    likely needs to participate in DAG ordering or be deferred until Worker.run() drain.

@ChaoWao ChaoWao force-pushed the orch-dynamic-comm-domain branch 3 times, most recently from c9d2bca to 681f331 Compare May 19, 2026 11:53
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented May 19, 2026

Thanks for the thorough review — addressed in 681f331 along with deleting the entire static-path API (per the maintainer's call: 只暴露 orch 里的接口,用户要用就得在 orch 里写).

P1#1 — use-after-free on with exit: fixed via deferred release. CommDomainHandle.release() now flips the handle to a released state (further indexing raises) but the backend comm_release_domain_windows call is queued, not dispatched. Worker.run runs _orch._drain() first, then _execute_pending_domain_releases(). Any task already submitted with the domain's device_ctx / buffer_ptrs sees live memory through execution. The handle has a separate freed property for code that needs to assert physical teardown, distinct from the user-facing released.

P1#2 — buffer overflow validates after backend alloc: fixed by checking sum(b.nbytes) <= window_size on the parent before dispatching CTRL_ALLOC_DOMAIN. Caller now fails fast; backend never registers a leakable allocation.

P2 — dynamic windows not zeroed: fixed in both backends. comm_alloc_domain_windows now zeroes the local pool after IPC handshake completes — std::memset on sim's mmap region, aclrtMemset on HCCL. Matches the static-path contract that scratch/signal protocols rely on.

Coverage gap (no regression test): addressed implicitly via the static-path deletion — the test that used to need adding ("submit a task using dynamic device_ctx") is now the only path; test_dynamic_alloc_sim.py exercises it end-to-end across allocate/release/with/reuse-name/exception-unwind/interleaved-subsets. A real-task ChipCallable-running regression for the deferred-release lifetime invariant is a useful follow-up but deferred from this PR.

Open question — lifetime model: settled as "release on with-exit, free after drain." The orch_fn observes the handle as released as soon as with exits (further indexing raises), but submitted tasks see live memory until drain. Equivalent to Python's with open(f): ... — the user-visible release is lexical, the physical release is runtime-managed.

@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented May 19, 2026

@uv-xiao Re your meta-question ("should reuse the same multi-comm-domain path as PR #752 ... seems to have created a completely new path"): you're right that we had two paths. Per the maintainer's call (初始化的path要删除,只暴露orch里的接口), 681f331 deletes the static path entirely.

What's gone:

  • Worker(comm_plan=...) parameter and all derived state
  • CommDomain, CommDomainPlan, ChipBootstrapConfig, ChipDomainBootstrapConfig, ChipBootstrapResult, ChipContext, HostBufferStaging dataclasses
  • ChipWorker.bootstrap_context + shutdown_bootstrap + their helpers
  • ChipBootstrapChannel (C++ + nanobind binding)
  • Worker.chip_contexts property
  • _chip_process_loop_with_bootstrap (chip side uses the plain loop now)
  • 4 test files: test_bootstrap_context_{sim,hw}.py, test_bootstrap_channel.py, test_comm_domain_plan.py, test_worker_distributed_{sim,hw}.py
  • 10 example directories + the 2 design docs in docs/multi-comm-domain*

What's there now: Worker.init() does no comm work; the first orch.allocate_domain(...) lazily fires CTRL_COMM_INIT (new mailbox sub-command) to dispatch comm_init to every chip in parallel — base HCCL RootInfo handshake happens on demand, not at init.

So the path duplication is gone. Examples need to be re-written to the orch-only API in a follow-up — that's a bigger kernel-level migration than fits this PR's scope.

@uv-xiao
Copy link
Copy Markdown
Contributor

uv-xiao commented May 19, 2026

@ChaoWao Will the sub-communicators be created every time if a worker runs multiple times, since the comm-domains are allocated inside orch_fns?

@puddingfjz
Copy link
Copy Markdown
Contributor

Comment 1: P1 cleanup skipped when drain raises

P1: dynamic domain cleanup is skipped if _orch._drain() raises.

Worker.run() currently does:

self._orch._scope_end()
self._orch._drain()
self._execute_pending_domain_releases()
if self._live_domains:
    self._release_all_live_domains()

But the comment above this block says drain() rethrows the first dispatch failure. If a submitted chip task fails, or a worker/dispatch error is surfaced from drain, both
_execute_pending_domain_releases() and _release_all_live_domains() are skipped.

That reintroduces a leak path for exactly the domains this PR is trying to make lifetime-safe:

  • handles released by with orch.allocate_domain(...) remain only logically released, not physically freed;
  • unreleased live handles are not swept;
  • subsequent runs may observe stale _live_domains / backend allocations.

Please wrap the drain and cleanup ordering so domain cleanup runs even when drain raises, while preserving the original exception. For example, use a nested try/finally around
_orch._drain() and run both pending-release and live-domain sweep in the cleanup path.

Comment 2: stale docstring

Small consistency issue: this docstring is stale after the parent-side overflow validation was added.

It still says:

buffers are carved sequentially inside the
window in declaration order; their ``nbytes`` sum may exceed
``window_size``the chip-side carve will raise then.

But _allocate_domain() now checks sum(b.nbytes) <= window_size before dispatching any chip-side allocation, which is the right behavior. Please update this text so users do not expect chip-side overflow handling here.

Comment 3: test coverage

The new tests cover allocate/release mechanics well, including sim interleaved subsets and one hardware smoke path, but they still do not exercise a real submitted chip task that consumes the returned device_ctx / buffer_ptrs.

@ChaoWao ChaoWao force-pushed the orch-dynamic-comm-domain branch 4 times, most recently from cd2bfda to 5d83e7e Compare May 20, 2026 04:53
Adds a runtime path for allocating CommDomain windows from inside the
orch function, alongside the existing static `Worker(comm_plan=...)`
path.  Dynamic allocation pays a collective cost (~hundreds of ms on
HCCL: aclrtMalloc + IPC import; sim: shm_open + ftruncate + ready
barrier) per orch.allocate_domain call, sized to actual need rather
than worst-case-pre-declared.

Public surface

  with orch.allocate_domain(
      name="tp",
      workers=[0, 1],
      window_size=4 * 1024 * 1024,
      buffers=[CommBufferSpec("scratch", "float32", n, n*4)],
  ) as tp:
      for chip_idx in tp.workers:
          sub_args = TaskArgs()
          sub_args.add_scalar(tp[chip_idx].device_ctx)
          sub_args.add_tensor(..., data=tp[chip_idx].buffer_ptrs["scratch"])
          orch.submit_next_level(cid, sub_args, cfg, worker=chip_idx)
  # `with`-exit auto-releases across all participating chips.

Implementation

  - C ABI (sim + HCCL): comm_alloc_domain_windows /
    comm_release_domain_windows.  HCCL impl mirrors alloc_windows_via_ipc
    but on a fresh per-allocation buffer with every barrier / IPC announce
    filename scoped by `allocation_id` so concurrent allocations don't
    collide.  Sim impl creates a fresh POSIX shm per allocation.  P2P
    routes (aclrtDeviceEnablePeerAccess) are inherited from the base
    allocation; only IPC export/import + symmetric pool is per-call.
    a5 backend ships not-supported stubs (-1).
  - ChipWorker C++ wrappers + nanobind binding.  Both methods return
    `(device_ctx, local_window_base)` for the calling rank.
  - Mailbox: CTRL_ALLOC_DOMAIN / CTRL_RELEASE_DOMAIN.  Variable payloads
    cross via per-call POSIX shms (request + reply); two shm names are
    staged back-to-back at MAILBOX_OFF_ARGS.  WorkerThread holds
    mailbox_mu_ so they serialise with task dispatch.
  - Orchestrator.allocate_domain / release_domain.  Drives the collective
    by fanning out per-chip control calls on threads (one per
    participating chip) and joining; raises if any chip fails after all
    join.
  - Worker._live_domains: tracks active handles; auto-released in
    Worker.run's `finally` on orch-fn exception and in Worker.close
    in LIFO order.  Each handle is a `CommDomainHandle` context manager.

Backward compatibility

  - `Worker(comm_plan=...)` still works unchanged.  comm_plan now
    serves a second purpose: declare a minimal "membership" domain so
    bootstrap_context establishes the HCCL/sim base communicator that
    dynamic allocations need as their root.

Sim test coverage

  tests/ut/py/test_worker/test_dynamic_alloc_sim.py (5 tests):
    - allocate returns distinct per-chip contexts
    - with-statement auto-releases on exit
    - alloc-after-release reuses the same name; allocation_id
      monotonically increments
    - duplicate name while live raises
    - orch_fn raises mid-DAG → Worker.run releases all live handles
      before propagating

  Local: pytest tests/ut/py/test_worker/ → 102 passed, 1 hw skip
  Pre-commit clean: clang-format / clang-tidy / cpplint / ruff / pyright

Non-goals

  - Async / Future-based allocation
  - Cross-host domains
  - Resize / grow existing domain
  - Allocating outside an orch function

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the orch-dynamic-comm-domain branch from 5d83e7e to 1e71e3f Compare May 20, 2026 06:34
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented May 20, 2026

@puddingfjz Thanks — addressed in 1e71e3ff:

Comment 1 (cleanup skipped when drain raises) — fixed. Worker.run's teardown now wraps the drain in a nested try/finally:

self._orch._scope_end()
try:
    self._orch._drain()
finally:
    self._execute_pending_domain_releases()
    if self._live_domains:
        self._release_all_live_domains()

So even when _drain() re-raises the first chip-task/dispatch failure, the pending backend releases and the live-domain sweep still run; the original exception still propagates. A failed run can no longer strand backend allocations into the next run.

Comment 2 (stale docstring) — fixed. orchestrator.allocate_domain's docstring now states that sum(b.nbytes) must fit window_size and is validated on the orch thread before any chip-side dispatch (raises ValueError, no backend leak), matching the actual parent-side check.

Comment 3 (real-task test coverage) — the migrated examples now exercise real submitted tasks consuming the dynamic device_ctx / buffer_ptrs end-to-end on hardware: allreduce_distributed, domain_rank_map, dual_domain_overlap, and sdma_async_completion_demo (the last stages via copy_to then cross-rank SdmaTget). These run under st-onboard-a2a3 / st-onboard-a5 (both green). A dedicated UT-level real-task regression remains a reasonable follow-up.

Also added docs/comm-domain.md documenting the API, the two-state lifetime model (released on with-exit, freed after drain), lazy base-comm caching, and the sim/HCCL backends.

@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented May 20, 2026

@uv-xiao No — the sub-communicators are not recreated per run. The base communicator is built lazily on the first allocate_domain (via the new CTRL_COMM_INIT mailbox sub-command running comm_init on every chip) and then cached (_comm_base_ready; ChipWorker.comm_init also caches its handle). Subsequent Worker.run calls and subsequent allocate_domain calls reuse that same base comm.

What does happen per allocate_domain is the per-domain window allocation — each gets a fresh allocation_id (so concurrent/sequential domains never collide on IPC-handshake/barrier names) and is freed after the run drains. So: base comm created once and reused; only the domain windows are allocated/freed per call.

Documented in docs/comm-domain.md §3 (Lazy base communicator).

@uv-xiao
Copy link
Copy Markdown
Contributor

uv-xiao commented May 20, 2026

@uv-xiao No — the sub-communicators are not recreated per run. The base communicator is built lazily on the first allocate_domain (via the new CTRL_COMM_INIT mailbox sub-command running comm_init on every chip) and then cached (_comm_base_ready; ChipWorker.comm_init also caches its handle). Subsequent Worker.run calls and subsequent allocate_domain calls reuse that same base comm.

What does happen per allocate_domain is the per-domain window allocation — each gets a fresh allocation_id (so concurrent/sequential domains never collide on IPC-handshake/barrier names) and is freed after the run drains. So: base comm created once and reused; only the domain windows are allocated/freed per call.

Documented in docs/comm-domain.md §3 (Lazy base communicator).

I see. What I actually want to discuss is the per-domain window, but I think it has been described in #824. I don't have more questions here.

Comment thread docs/comm-domain.md
Comment on lines +15 to +34
## 1. API

```python
with orch.allocate_domain(
name="default", # local label (peers need not agree)
workers=[0, 1], # subset of the Worker's device_ids indices
window_size=4096, # per-rank symmetric window, bytes
buffers=[ # named slices carved from the window
CommBufferSpec(name="scratch", dtype="float32", count=1024, nbytes=4096),
],
) as handle:
for chip_idx in handle.workers:
domain = handle[chip_idx] # -> ChipDomainContext
...
orch.submit_next_level(cid, args, cfg, worker=chip_idx)
```

`window_size` is validated on the orch thread **before** any chip-side
allocation: if `sum(b.nbytes) > window_size`, `allocate_domain` raises
`ValueError` immediately and no backend allocation is registered.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it clear that this code must exist in orch_fn?

@ChaoWao ChaoWao merged commit 7b97dfd into hw-native-sys:main May 20, 2026
15 checks passed
@ChaoWao ChaoWao deleted the orch-dynamic-comm-domain branch May 20, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants