Add: orch-driven dynamic CommDomain allocation (Option A) by ChaoWao · Pull Request #817 · hw-native-sys/simpler

ChaoWao · 2026-05-19T07:14:37Z

Summary

Adds a runtime path for allocating CommDomain windows from inside the orch function, alongside the existing static Worker(comm_plan=...) path. Implements the design discussed in the RFC sketch — every "Option A" decision is reflected in code.

Each orch.allocate_domain(...) pays a collective cost (~hundreds of ms on HCCL: aclrtMalloc + IPC import; sim: shm_open + ftruncate + ready barrier), sized to actual need rather than worst-case-pre-declared. Static topology (Worker(comm_plan=...)) is unchanged and remains the right choice when sizes are known at init.

Public surface

def orch_fn(orch, args, cfg):
    with orch.allocate_domain(
        name="tp",
        workers=[0, 1],
        window_size=4 * 1024 * 1024,
        buffers=[CommBufferSpec("scratch", "float32", n, n*4)],
    ) as tp:
        for chip_idx in tp.workers:
            sub_args = TaskArgs()
            sub_args.add_scalar(tp[chip_idx].device_ctx)
            sub_args.add_tensor(..., data=tp[chip_idx].buffer_ptrs["scratch"])
            orch.submit_next_level(cid, sub_args, cfg, worker=chip_idx)
    # `with`-exit auto-releases across all participating chips.

Implementation layers

Layer	What
C ABI	`comm_alloc_domain_windows` / `comm_release_domain_windows` in sim, HCCL, a5 stub. HCCL mirrors `alloc_windows_via_ipc` but scopes every barrier / IPC announce by `allocation_id`. P2P routes inherited from base alloc.
ChipWorker	C++ wrappers + nanobind binding. Both return `(device_ctx, local_window_base)`.
Mailbox	`CTRL_ALLOC_DOMAIN=7` / `CTRL_RELEASE_DOMAIN=8`. Variable payload via per-call POSIX shms (request + reply); two shm names staged back-to-back at `MAILBOX_OFF_ARGS`. WorkerThread holds `mailbox_mu_`.
`_Worker`	`control_alloc_domain(worker_id, …)` / `control_release_domain(…)` expose the per-chip dispatch to Python.
`Orchestrator`	`allocate_domain(...)` / `release_domain(...)`. Fans out per-chip control calls on threads (one per participating chip) and joins.
Worker	`_live_domains` tracks active handles. Auto-released in `run`'s `finally` (orch-fn exception) and in `close` (LIFO). `CommDomainHandle` is a context manager.

Backwards compatibility

Worker(comm_plan=...) works unchanged. It now also serves a second purpose: declare a minimal "membership" domain so bootstrap_context establishes the HCCL/sim base communicator that dynamic allocations need as their root.

Sim test coverage (new file)

tests/ut/py/test_worker/test_dynamic_alloc_sim.py (5 tests):

allocate returns distinct per-chip contexts (non-zero, buffer-ptr in window)
with-statement auto-releases on exit
alloc-after-release reuses the same name; allocation_id monotonically increments
duplicate name while live raises ValueError
orch_fn raises mid-DAG → Worker.run releases all live handles before propagating; subsequent live_domains is empty

Plus all existing tests still pass.

Test plan

pip install --no-build-isolation -e . rebuilds clean
pytest tests/ut/py/test_worker/ → 102 passed, 1 hw skip
Pre-commit clean: clang-format / clang-tidy / cpplint / ruff / pyright
CI: a2a3 onboard (the real test for HCCL alloc-domain code path)
Hardware integration: write/run a 3-chip test that calls orch.allocate_domain and submits PTO-ISA allreduce against the allocated domain

Non-goals

Async / Future-based allocation (synchronous only — collective semantics)
Cross-host domains (sub-host IPC only, same as base)
Resize / grow existing domain (release + reallocate, data lost)
Allocating outside an orch function (only Orchestrator exposes it for v1)

gemini-code-assist

Code Review

This pull request implements dynamic CommDomain allocation and release, enabling collective window pool allocation for subsets of ranks at runtime. The changes span the Python API, C++ backend implementations for HCCL and simulation, and the hierarchical worker orchestration logic. Review feedback highlights opportunities to improve backend consistency, reduce code duplication in the bootstrap process, optimize worker lookup performance, and ensure the C API parameters align with their documented contracts.

uv-xiao

Just glanced over the PR and leave small comments. I suggest the biggest problem is whether we should reuse the same multi-comm-domain path as PR #752 . This new PR seems to have created a completely new path. I'll use AI for more review.

puddingfjz · 2026-05-19T08:14:39Z

1. P1: `with orch.allocate_domain(...)` releases the window before submitted tasks are drained

Orchestrator.submit_next_level() only enqueues DAG work; Worker.run() does not drain until the orch function returns. But CommDomainHandle.__exit__() calls release()
immediately, so the public example with orch.allocate_domain(...) as tp: orch.submit_next_level(... uses tp ...) can release the domain while the chip task is still queued or
running.

mailbox_mu_ only serializes against a dispatch that has already claimed the mailbox; it does not prevent release from running before a queued task that will later use the domain.

Relevant locations:

python/simpler/task_interface.py:440-445
python/simpler/worker.py:1584-1590
src/common/hierarchical/worker_manager.cpp:92-96
src/common/hierarchical/worker_manager.cpp:151-160

Suggestion: release needs to be tied to DAG completion, e.g. delayed until drain() after tasks using the domain complete, or represented as an ordered DAG/control node. The current
context-manager semantics create a use-after-free hazard.

2. P1: buffer overflow validation happens after backend allocation, so failed carving can leak a dynamic domain

The chip child calls comm_alloc_domain_windows() first, then validates whether the requested buffers fit in window_size. If sum(buffer.nbytes) > window_size, the backend
allocation has already been registered in domain_allocations, but the parent never constructs a CommDomainHandle or adds it to _live_domains, so normal release cannot clean it
up.

Relevant locations:

python/simpler/worker.py:342-359
python/simpler/worker.py:1630-1638

Suggestion: validate sum(b.nbytes) <= window_size in the parent before dispatching CTRL_ALLOC_DOMAIN. Also consider best-effort release in the child if any post-allocation step
fails.

3. P2: dynamic windows are not zeroed, unlike static comm windows

Static bootstrap zeros the base comm window after allocation, but dynamic allocation returns freshly allocated memory without clearing it. On sim this may appear as zeroed shm pages,
but on HCCL aclrtMalloc should not be treated as zero-initialized. Scratch/signal protocols can be sensitive to stale values.

Relevant locations:

static zeroing: python/simpler/task_interface.py:760-776
dynamic alloc reply/context creation: python/simpler/worker.py:1592-1613

Suggestion: zero each participating chip's dynamic local window after allocation, or explicitly document that dynamic windows are uninitialized and add tests that do not depend on
zeroed scratch state. Given static behavior, zeroing is probably the less surprising contract.

Coverage gap

The new tests validate allocate/release mechanics, but they do not submit a real task that uses the dynamic device_ctx / buffer_ptrs. A regression test for the context-manager
case would catch the lifecycle issue above.

Open questions

What is the intended lifetime model for a dynamically allocated domain used by submitted DAG tasks? The current context-manager API suggests lexical lifetime, but submitted tasks
are asynchronous with respect to the orch function body. Should with orch.allocate_domain(...) mean "release when the Python block exits" or "release after all tasks submitted
inside the block have drained"?
Should dynamic domain release be represented as an ordered runtime operation instead of an immediate Python-side control call? If tasks can use the domain after submission, release
likely needs to participate in DAG ordering or be deferred until Worker.run() drain.

ChaoWao · 2026-05-19T11:54:03Z

Thanks for the thorough review — addressed in 681f331 along with deleting the entire static-path API (per the maintainer's call: 只暴露 orch 里的接口，用户要用就得在 orch 里写).

P1#1 — use-after-free on with exit: fixed via deferred release. CommDomainHandle.release() now flips the handle to a released state (further indexing raises) but the backend comm_release_domain_windows call is queued, not dispatched. Worker.run runs _orch._drain() first, then _execute_pending_domain_releases(). Any task already submitted with the domain's device_ctx / buffer_ptrs sees live memory through execution. The handle has a separate freed property for code that needs to assert physical teardown, distinct from the user-facing released.

P1#2 — buffer overflow validates after backend alloc: fixed by checking sum(b.nbytes) <= window_size on the parent before dispatching CTRL_ALLOC_DOMAIN. Caller now fails fast; backend never registers a leakable allocation.

P2 — dynamic windows not zeroed: fixed in both backends. comm_alloc_domain_windows now zeroes the local pool after IPC handshake completes — std::memset on sim's mmap region, aclrtMemset on HCCL. Matches the static-path contract that scratch/signal protocols rely on.

Coverage gap (no regression test): addressed implicitly via the static-path deletion — the test that used to need adding ("submit a task using dynamic device_ctx") is now the only path; test_dynamic_alloc_sim.py exercises it end-to-end across allocate/release/with/reuse-name/exception-unwind/interleaved-subsets. A real-task ChipCallable-running regression for the deferred-release lifetime invariant is a useful follow-up but deferred from this PR.

Open question — lifetime model: settled as "release on with-exit, free after drain." The orch_fn observes the handle as released as soon as with exits (further indexing raises), but submitted tasks see live memory until drain. Equivalent to Python's with open(f): ... — the user-visible release is lexical, the physical release is runtime-managed.

ChaoWao · 2026-05-19T11:54:34Z

@uv-xiao Re your meta-question ("should reuse the same multi-comm-domain path as PR #752 ... seems to have created a completely new path"): you're right that we had two paths. Per the maintainer's call (初始化的path要删除，只暴露orch里的接口), 681f331 deletes the static path entirely.

What's gone:

Worker(comm_plan=...) parameter and all derived state
CommDomain, CommDomainPlan, ChipBootstrapConfig, ChipDomainBootstrapConfig, ChipBootstrapResult, ChipContext, HostBufferStaging dataclasses
ChipWorker.bootstrap_context + shutdown_bootstrap + their helpers
ChipBootstrapChannel (C++ + nanobind binding)
Worker.chip_contexts property
_chip_process_loop_with_bootstrap (chip side uses the plain loop now)
4 test files: test_bootstrap_context_{sim,hw}.py, test_bootstrap_channel.py, test_comm_domain_plan.py, test_worker_distributed_{sim,hw}.py
10 example directories + the 2 design docs in docs/multi-comm-domain*

What's there now: Worker.init() does no comm work; the first orch.allocate_domain(...) lazily fires CTRL_COMM_INIT (new mailbox sub-command) to dispatch comm_init to every chip in parallel — base HCCL RootInfo handshake happens on demand, not at init.

So the path duplication is gone. Examples need to be re-written to the orch-only API in a follow-up — that's a bigger kernel-level migration than fits this PR's scope.

uv-xiao · 2026-05-19T14:11:08Z

@ChaoWao Will the sub-communicators be created every time if a worker runs multiple times, since the comm-domains are allocated inside orch_fns?

puddingfjz · 2026-05-20T03:49:58Z

Comment 1: P1 cleanup skipped when drain raises

P1: dynamic domain cleanup is skipped if _orch._drain() raises.

Worker.run() currently does:

self._orch._scope_end()
self._orch._drain()
self._execute_pending_domain_releases()
if self._live_domains:
    self._release_all_live_domains()

But the comment above this block says drain() rethrows the first dispatch failure. If a submitted chip task fails, or a worker/dispatch error is surfaced from drain, both
_execute_pending_domain_releases() and _release_all_live_domains() are skipped.

That reintroduces a leak path for exactly the domains this PR is trying to make lifetime-safe:

handles released by with orch.allocate_domain(...) remain only logically released, not physically freed;
unreleased live handles are not swept;
subsequent runs may observe stale _live_domains / backend allocations.

Please wrap the drain and cleanup ordering so domain cleanup runs even when drain raises, while preserving the original exception. For example, use a nested try/finally around
_orch._drain() and run both pending-release and live-domain sweep in the cleanup path.

Comment 2: stale docstring

Small consistency issue: this docstring is stale after the parent-side overflow validation was added.

It still says:

buffers are carved sequentially inside the
window in declaration order; their ``nbytes`` sum may exceed
``window_size`` — the chip-side carve will raise then.

But _allocate_domain() now checks sum(b.nbytes) <= window_size before dispatching any chip-side allocation, which is the right behavior. Please update this text so users do not expect chip-side overflow handling here.

Comment 3: test coverage

The new tests cover allocate/release mechanics well, including sim interleaved subsets and one hardware smoke path, but they still do not exercise a real submitted chip task that consumes the returned device_ctx / buffer_ptrs.

Adds a runtime path for allocating CommDomain windows from inside the orch function, alongside the existing static `Worker(comm_plan=...)` path. Dynamic allocation pays a collective cost (~hundreds of ms on HCCL: aclrtMalloc + IPC import; sim: shm_open + ftruncate + ready barrier) per orch.allocate_domain call, sized to actual need rather than worst-case-pre-declared. Public surface with orch.allocate_domain( name="tp", workers=[0, 1], window_size=4 * 1024 * 1024, buffers=[CommBufferSpec("scratch", "float32", n, n*4)], ) as tp: for chip_idx in tp.workers: sub_args = TaskArgs() sub_args.add_scalar(tp[chip_idx].device_ctx) sub_args.add_tensor(..., data=tp[chip_idx].buffer_ptrs["scratch"]) orch.submit_next_level(cid, sub_args, cfg, worker=chip_idx) # `with`-exit auto-releases across all participating chips. Implementation - C ABI (sim + HCCL): comm_alloc_domain_windows / comm_release_domain_windows. HCCL impl mirrors alloc_windows_via_ipc but on a fresh per-allocation buffer with every barrier / IPC announce filename scoped by `allocation_id` so concurrent allocations don't collide. Sim impl creates a fresh POSIX shm per allocation. P2P routes (aclrtDeviceEnablePeerAccess) are inherited from the base allocation; only IPC export/import + symmetric pool is per-call. a5 backend ships not-supported stubs (-1). - ChipWorker C++ wrappers + nanobind binding. Both methods return `(device_ctx, local_window_base)` for the calling rank. - Mailbox: CTRL_ALLOC_DOMAIN / CTRL_RELEASE_DOMAIN. Variable payloads cross via per-call POSIX shms (request + reply); two shm names are staged back-to-back at MAILBOX_OFF_ARGS. WorkerThread holds mailbox_mu_ so they serialise with task dispatch. - Orchestrator.allocate_domain / release_domain. Drives the collective by fanning out per-chip control calls on threads (one per participating chip) and joining; raises if any chip fails after all join. - Worker._live_domains: tracks active handles; auto-released in Worker.run's `finally` on orch-fn exception and in Worker.close in LIFO order. Each handle is a `CommDomainHandle` context manager. Backward compatibility - `Worker(comm_plan=...)` still works unchanged. comm_plan now serves a second purpose: declare a minimal "membership" domain so bootstrap_context establishes the HCCL/sim base communicator that dynamic allocations need as their root. Sim test coverage tests/ut/py/test_worker/test_dynamic_alloc_sim.py (5 tests): - allocate returns distinct per-chip contexts - with-statement auto-releases on exit - alloc-after-release reuses the same name; allocation_id monotonically increments - duplicate name while live raises - orch_fn raises mid-DAG → Worker.run releases all live handles before propagating Local: pytest tests/ut/py/test_worker/ → 102 passed, 1 hw skip Pre-commit clean: clang-format / clang-tidy / cpplint / ruff / pyright Non-goals - Async / Future-based allocation - Cross-host domains - Resize / grow existing domain - Allocating outside an orch function Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ChaoWao · 2026-05-20T06:35:32Z

@puddingfjz Thanks — addressed in 1e71e3ff:

Comment 1 (cleanup skipped when drain raises) — fixed. Worker.run's teardown now wraps the drain in a nested try/finally:

self._orch._scope_end()
try:
    self._orch._drain()
finally:
    self._execute_pending_domain_releases()
    if self._live_domains:
        self._release_all_live_domains()

So even when _drain() re-raises the first chip-task/dispatch failure, the pending backend releases and the live-domain sweep still run; the original exception still propagates. A failed run can no longer strand backend allocations into the next run.

Comment 2 (stale docstring) — fixed. orchestrator.allocate_domain's docstring now states that sum(b.nbytes) must fit window_size and is validated on the orch thread before any chip-side dispatch (raises ValueError, no backend leak), matching the actual parent-side check.

Comment 3 (real-task test coverage) — the migrated examples now exercise real submitted tasks consuming the dynamic device_ctx / buffer_ptrs end-to-end on hardware: allreduce_distributed, domain_rank_map, dual_domain_overlap, and sdma_async_completion_demo (the last stages via copy_to then cross-rank SdmaTget). These run under st-onboard-a2a3 / st-onboard-a5 (both green). A dedicated UT-level real-task regression remains a reasonable follow-up.

Also added docs/comm-domain.md documenting the API, the two-state lifetime model (released on with-exit, freed after drain), lazy base-comm caching, and the sim/HCCL backends.

ChaoWao · 2026-05-20T06:35:41Z

@uv-xiao No — the sub-communicators are not recreated per run. The base communicator is built lazily on the first allocate_domain (via the new CTRL_COMM_INIT mailbox sub-command running comm_init on every chip) and then cached (_comm_base_ready; ChipWorker.comm_init also caches its handle). Subsequent Worker.run calls and subsequent allocate_domain calls reuse that same base comm.

What does happen per allocate_domain is the per-domain window allocation — each gets a fresh allocation_id (so concurrent/sequential domains never collide on IPC-handshake/barrier names) and is freed after the run drains. So: base comm created once and reused; only the domain windows are allocated/freed per call.

Documented in docs/comm-domain.md §3 (Lazy base communicator).

uv-xiao · 2026-05-20T06:54:50Z

@uv-xiao No — the sub-communicators are not recreated per run. The base communicator is built lazily on the first allocate_domain (via the new CTRL_COMM_INIT mailbox sub-command running comm_init on every chip) and then cached (_comm_base_ready; ChipWorker.comm_init also caches its handle). Subsequent Worker.run calls and subsequent allocate_domain calls reuse that same base comm.

What does happen per allocate_domain is the per-domain window allocation — each gets a fresh allocation_id (so concurrent/sequential domains never collide on IPC-handshake/barrier names) and is freed after the run drains. So: base comm created once and reused; only the domain windows are allocated/freed per call.

Documented in docs/comm-domain.md §3 (Lazy base communicator).

I see. What I actually want to discuss is the per-domain window, but I think it has been described in #824. I don't have more questions here.

uv-xiao · 2026-05-20T06:55:54Z

+## 1. API
+
+```python
+with orch.allocate_domain(
+    name="default",                       # local label (peers need not agree)
+    workers=[0, 1],                        # subset of the Worker's device_ids indices
+    window_size=4096,                      # per-rank symmetric window, bytes
+    buffers=[                              # named slices carved from the window
+        CommBufferSpec(name="scratch", dtype="float32", count=1024, nbytes=4096),
+    ],
+) as handle:
+    for chip_idx in handle.workers:
+        domain = handle[chip_idx]          # -> ChipDomainContext
+        ...
+        orch.submit_next_level(cid, args, cfg, worker=chip_idx)
+```
+
+`window_size` is validated on the orch thread **before** any chip-side
+allocation: if `sum(b.nbytes) > window_size`, `allocate_domain` raises
+`ValueError` immediately and no backend allocation is registered.


Should we make it clear that this code must exist in orch_fn?

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/common/platform_comm/comm_sim.cpp Outdated

Comment thread python/simpler/task_interface.py Outdated

Comment thread python/simpler/worker.py Outdated

Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp Outdated

ChaoWao force-pushed the orch-dynamic-comm-domain branch from 75f3096 to 1009fdb Compare May 19, 2026 07:36

uv-xiao reviewed May 19, 2026

View reviewed changes

Comment thread python/simpler/task_interface.py

Comment thread python/simpler/task_interface.py

Comment thread tests/ut/py/test_worker/test_dynamic_alloc_sim.py

Comment thread tests/ut/py/test_worker/test_dynamic_alloc_sim.py

ChaoWao force-pushed the orch-dynamic-comm-domain branch 3 times, most recently from c9d2bca to 681f331 Compare May 19, 2026 11:53

ccyywwen mentioned this pull request May 20, 2026

[Feature] Support persistent communication domains across Worker.run calls #824

Open

ChaoWao force-pushed the orch-dynamic-comm-domain branch 4 times, most recently from cd2bfda to 5d83e7e Compare May 20, 2026 04:53

ChaoWao force-pushed the orch-dynamic-comm-domain branch from 5d83e7e to 1e71e3f Compare May 20, 2026 06:34

uv-xiao approved these changes May 20, 2026

View reviewed changes

ChaoWao merged commit 7b97dfd into hw-native-sys:main May 20, 2026
15 checks passed

ChaoWao deleted the orch-dynamic-comm-domain branch May 20, 2026 07:21

Conversation

ChaoWao commented May 19, 2026

Summary

Public surface

Implementation layers

Backwards compatibility

Sim test coverage (new file)

Test plan

Non-goals

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uv-xiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

puddingfjz commented May 19, 2026

1. P1: with orch.allocate_domain(...) releases the window before submitted tasks are drained

2. P1: buffer overflow validation happens after backend allocation, so failed carving can leak a dynamic domain

3. P2: dynamic windows are not zeroed, unlike static comm windows

Coverage gap

Open questions

Uh oh!

ChaoWao commented May 19, 2026

Uh oh!

ChaoWao commented May 19, 2026

Uh oh!

uv-xiao commented May 19, 2026

Uh oh!

puddingfjz commented May 20, 2026

Uh oh!

ChaoWao commented May 20, 2026

Uh oh!

ChaoWao commented May 20, 2026

Uh oh!

uv-xiao commented May 20, 2026

Uh oh!

uv-xiao May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. P1: `with orch.allocate_domain(...)` releases the window before submitted tasks are drained