Refactor: host-build trb runtime arena (a2a3 only)#846
Conversation
Move the per-slot payload/task pointer assignments out of the RingSchedState::init() O(task_window_size) loop and into orch::prepare_task. Their value is per-slot constant (&task_payloads[slot] / &task_descriptors[slot]) but writing them at submit time, on the same 64B slot_state cache line prepare_task is already dirtying, is essentially free — while removing the only "scale-dependent" pointer assignments from the init path. ring_id stays in init (its value is per-ring constant, so rewriting it each submit would only add noise without removing a loop). Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and bind_buffers() (per-submit) to make the two call-site shapes explicit. Mirrored across both a2a3 and a5 trb runtimes.
Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime, orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper, mailbox) on every device boot via runtime_create_from_sm. This commit moves layout + data init onto the host so the AICPU only does a cheap arena-internal pointer wire pass plus the SM reset that can't run off-device. Multi-run boots reuse the pooled prebuilt image with a single rtMemcpy. Mechanism - DeviceArena::attach() wraps an externally-owned buffer; re-attach is permitted so each AICPU boot can reuse the pooled image. - runtime_create_from_sm split into reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler / tensor_map / ready_queue / spsc gain matching data+wire pairs; finalize_after_wire stays AICPU-only since it binds s_runtime_ops. - pto2_sm_layout helper computes SM field device addresses by pure offset arithmetic so host init never dereferences SM. - Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask) moved from RingSchedState::init into PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it after the split. - runtime/shared/pto_runtime2_init.cpp — new file holding the host-able pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp / pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay in place. Host wiring (runtime_maker.cpp) - DeviceRunner::setup_static_arena gains a third runtime_arena_size region (hbg passes 0). The prebuilt image lives in the same pooled backing allocation as gm_heap and SM, keeping worker lifetime to one rtMalloc. - bind_prepared_to_runtime_impl reserves layout on a host arena, sizes the pooled regions, runs init_data + wire, stashes prebuilt metadata into the rt image, rtMemcpys to device, and records base/offset on Runtime so the AICPU boot can find it. AICPU boot (aicpu_executor.cpp) - attach the runtime arena to the pooled buffer, take rt from base+off_runtime, wire arena-internal pointers, sm_handle->init (SM reset including the per-slot fields above), mailbox reset, finalize_after_wire (ops table + cluster/aiv counts). Tests - cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state / task_state / wiring / tensormap UTs migrated to the data+wire API. task_allocator.init grew an optional initial_local_task_id (default 0) so UTs can still exercise task_id near INT32_MAX without reading the SM. - a2a3sim trb: standalone (dynamic_register variants, L3 group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass. - a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi changes don't break hbg). - a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9 (--build with pto-isa commit pinned to CI).
There was a problem hiding this comment.
Code Review
This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.
| void *DeviceRunner::acquire_pooled_runtime_arena() { | ||
| if (!static_arena_.is_committed()) return nullptr; | ||
| return static_arena_.region_ptr(runtime_arena_region_off_); | ||
| } |
There was a problem hiding this comment.
The acquire_pooled_runtime_arena method should defensively check if the runtime arena region was actually provisioned. If setup_static_arena was called with runtime_arena_size == 0, runtime_arena_region_off_ remains SIZE_MAX. Calling region_ptr(SIZE_MAX) results in an invalid pointer calculation and undefined behavior. Following the repository's preference for avoiding undefined behavior when environment setup is incomplete, a silent return is preferred here.
void *DeviceRunner::acquire_pooled_runtime_arena() {
if (!static_arena_.is_committed() || runtime_arena_region_off_ == SIZE_MAX) return nullptr;
return static_arena_.region_ptr(runtime_arena_region_off_);
}References
- Prefer a silent return over proceeding with operations that result in undefined behavior when prerequisite environment setup or conditions are not met.
| void *DeviceRunner::acquire_pooled_runtime_arena() { | ||
| if (!static_arena_.is_committed()) return nullptr; | ||
| return static_arena_.region_ptr(runtime_arena_region_off_); | ||
| } |
There was a problem hiding this comment.
This method should check if runtime_arena_region_off_ is valid before calling region_ptr. If the arena wasn't provisioned, the offset remains SIZE_MAX, leading to undefined behavior. A defensive check and silent return ensure safety in cases where the environment is not fully initialized, aligning with repository safety standards.
| void *DeviceRunner::acquire_pooled_runtime_arena() { | |
| if (!static_arena_.is_committed()) return nullptr; | |
| return static_arena_.region_ptr(runtime_arena_region_off_); | |
| } | |
| void *DeviceRunner::acquire_pooled_runtime_arena() { | |
| if (!static_arena_.is_committed() || runtime_arena_region_off_ == SIZE_MAX) return nullptr; | |
| return static_arena_.region_ptr(runtime_arena_region_off_); | |
| } |
References
- Prefer a silent return over proceeding with operations that result in undefined behavior when prerequisite environment setup or conditions are not met.
Summary
Two-commit refactor on the trb runtime, both authored by @poursoul. The PR
bundles them because the second is built on top of the first; squash-merge
gives a single coherent landing.
fe5d662 — Refactor: defer slot_state payload/task bind to orch::prepare_task
RingSchedState::initinto per-submit
prepare_task, making startup independent of window size.pto_orchestrator.cpp,pto_runtime2_types.h,pto_scheduler.cpponeach arch).
d33daa5 — Refactor: host-build trb runtime arena, AICPU does only wire + SM reset ⚠ a2a3 only
runtime_create_from_smonto the host. AICPU boot becomes a cheaparena-internal pointer wire pass + the SM reset that can't run off-device.
gm_heap and SM (one rtMalloc per worker), reused across all subsequent
runs via a single rtMemcpy.
src/a5/**is untouched in thiscommit (a5 keeps its current AICPU-side
runtime_create_from_smpath).The plan is to mirror to a5 in a follow-up PR after this lands and
stabilizes on a2a3 hardware/sim.
Mechanism (commit 2 / d33daa5)
DeviceArena::attach()wraps an externally-owned buffer; re-attach ispermitted so each AICPU boot can reuse the pooled image.
runtime_create_from_smsplit intoreserve_layout/init_data_from_layout/
wire_arena_pointers/finalize_after_wire; orchestrator / scheduler /tensor_map / ready_queue / spsc gain matching data+wire pairs.
finalize_after_wirestays AICPU-only since it bindss_runtime_ops.pto2_sm_layouthelper computes SM device-side field addresses by pureoffset arithmetic so host init never dereferences SM.
RingSchedState::initintoPTO2SharedMemoryHandle::init_header_per_ringso the AICPU still owns it.runtime/shared/pto_runtime2_init.cppholds the host-able pieceslifted out of
pto_runtime2.cpp/pto_orchestrator.cpp/pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay put.DeviceRunner::setup_static_arenanow takes a thirdruntime_arena_sizeregion (hbg passes 0 — hbg has no prebuilt runtime arena).
Why a5 is deliberately not touched in this PR
The host-build refactor is a non-trivial reshape of the runtime arena init
path. Keeping a5 on the old AICPU-side path until a2a3 has time on real
hardware lets us validate the new contract (layout/init/wire/finalize phases,
pooled image lifecycle, SM-reset boundary) without making a5 a moving target.
Once stable, the a5 mirror is a mechanical follow-up.
Test plan
task_state / wiring / tensormap UTs migrated to the data+wire API.
task_allocator.initgrew an optionalinitial_local_task_id(default0) so the near-INT32_MAX corner case is still exercised without an SM
dereference.
group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
changes (3-arg
setup_static_arena, newacquire_pooled_runtime_arenafield) don't break hbg.
tests/st/.../paged_attention_unrollpasses ondevice 9 (
--build, pto-isa commit pinned to CI).