Refactor: host-build trb runtime arena (a2a3 only) by poursoul · Pull Request #846 · hw-native-sys/simpler

poursoul · 2026-05-22T10:01:15Z

Summary

Two-commit refactor on the trb runtime, both authored by @poursoul. The PR
bundles them because the second is built on top of the first; squash-merge
gives a single coherent landing.

fe5d662 — Refactor: defer slot_state payload/task bind to orch::prepare_task
- Lifts the O(task_window_size) slot bind loop out of RingSchedState::init
  into per-submit prepare_task, making startup independent of window size.
- Mirrored across both a2a3 and a5 trb runtimes (touches
  pto_orchestrator.cpp, pto_runtime2_types.h, pto_scheduler.cpp on
  each arch).
d33daa5 — Refactor: host-build trb runtime arena, AICPU does only wire + SM reset ⚠ a2a3 only
- Moves the entire trb runtime arena layout + data init from AICPU's
  runtime_create_from_sm onto the host. AICPU boot becomes a cheap
  arena-internal pointer wire pass + the SM reset that can't run off-device.
- Pooled prebuilt image lives in the same DeviceRunner static_arena as
  gm_heap and SM (one rtMalloc per worker), reused across all subsequent
  runs via a single rtMemcpy.
- Scope is intentionally a2a3-only: src/a5/** is untouched in this
  commit (a5 keeps its current AICPU-side runtime_create_from_sm path).
  The plan is to mirror to a5 in a follow-up PR after this lands and
  stabilizes on a2a3 hardware/sim.

Mechanism (commit 2 / `d33daa5`)

DeviceArena::attach() wraps an externally-owned buffer; re-attach is
permitted so each AICPU boot can reuse the pooled image.
runtime_create_from_sm split into reserve_layout / init_data_from_layout
/ wire_arena_pointers / finalize_after_wire; orchestrator / scheduler /
tensor_map / ready_queue / spsc gain matching data+wire pairs.
finalize_after_wire stays AICPU-only since it binds s_runtime_ops.
pto2_sm_layout helper computes SM device-side field addresses by pure
offset arithmetic so host init never dereferences SM.
Per-slot SM-side reset moved from RingSchedState::init into
PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it.
New file runtime/shared/pto_runtime2_init.cpp holds the host-able pieces
lifted out of pto_runtime2.cpp / pto_orchestrator.cpp /
pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay put.
DeviceRunner::setup_static_arena now takes a third runtime_arena_size
region (hbg passes 0 — hbg has no prebuilt runtime arena).

Why a5 is deliberately not touched in this PR

The host-build refactor is a non-trivial reshape of the runtime arena init
path. Keeping a5 on the old AICPU-side path until a2a3 has time on real
hardware lets us validate the new contract (layout/init/wire/finalize phases,
pooled image lifecycle, SM-reset boundary) without making a5 a moving target.
Once stable, the a5 mirror is a mechanical follow-up.

Test plan

cpput: 25/25 pass — ready_queue / spsc_queue / scheduler_state /
task_state / wiring / tensormap UTs migrated to the data+wire API.
task_allocator.init grew an optional initial_local_task_id (default
0) so the near-INT32_MAX corner case is still exercised without an SM
dereference.
a2a3sim trb: standalone (dynamic_register variants, L3
group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
a2a3sim host_build_graph: 9/9 pass — verifies the shared HostApi
changes (3-arg setup_static_arena, new acquire_pooled_runtime_arena
field) don't break hbg.
a2a3 hardware: tests/st/.../paged_attention_unroll passes on
device 9 (--build, pto-isa commit pinned to CI).

Move the per-slot payload/task pointer assignments out of the RingSchedState::init() O(task_window_size) loop and into orch::prepare_task. Their value is per-slot constant (&task_payloads[slot] / &task_descriptors[slot]) but writing them at submit time, on the same 64B slot_state cache line prepare_task is already dirtying, is essentially free — while removing the only "scale-dependent" pointer assignments from the init path. ring_id stays in init (its value is per-ring constant, so rewriting it each submit would only add noise without removing a loop). Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and bind_buffers() (per-submit) to make the two call-site shapes explicit. Mirrored across both a2a3 and a5 trb runtimes.

Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime, orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper, mailbox) on every device boot via runtime_create_from_sm. This commit moves layout + data init onto the host so the AICPU only does a cheap arena-internal pointer wire pass plus the SM reset that can't run off-device. Multi-run boots reuse the pooled prebuilt image with a single rtMemcpy. Mechanism - DeviceArena::attach() wraps an externally-owned buffer; re-attach is permitted so each AICPU boot can reuse the pooled image. - runtime_create_from_sm split into reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler / tensor_map / ready_queue / spsc gain matching data+wire pairs; finalize_after_wire stays AICPU-only since it binds s_runtime_ops. - pto2_sm_layout helper computes SM field device addresses by pure offset arithmetic so host init never dereferences SM. - Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask) moved from RingSchedState::init into PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it after the split. - runtime/shared/pto_runtime2_init.cpp — new file holding the host-able pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp / pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay in place. Host wiring (runtime_maker.cpp) - DeviceRunner::setup_static_arena gains a third runtime_arena_size region (hbg passes 0). The prebuilt image lives in the same pooled backing allocation as gm_heap and SM, keeping worker lifetime to one rtMalloc. - bind_prepared_to_runtime_impl reserves layout on a host arena, sizes the pooled regions, runs init_data + wire, stashes prebuilt metadata into the rt image, rtMemcpys to device, and records base/offset on Runtime so the AICPU boot can find it. AICPU boot (aicpu_executor.cpp) - attach the runtime arena to the pooled buffer, take rt from base+off_runtime, wire arena-internal pointers, sm_handle->init (SM reset including the per-slot fields above), mailbox reset, finalize_after_wire (ops table + cluster/aiv counts). Tests - cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state / task_state / wiring / tensormap UTs migrated to the data+wire API. task_allocator.init grew an optional initial_local_task_id (default 0) so UTs can still exercise task_id near INT32_MAX without reading the SM. - a2a3sim trb: standalone (dynamic_register variants, L3 group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass. - a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi changes don't break hbg). - a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9 (--build with pto-isa commit pinned to CI).

gemini-code-assist

Code Review

This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.

gemini-code-assist · 2026-05-22T10:03:48Z

+void *DeviceRunner::acquire_pooled_runtime_arena() {
+    if (!static_arena_.is_committed()) return nullptr;
+    return static_arena_.region_ptr(runtime_arena_region_off_);
+}


The acquire_pooled_runtime_arena method should defensively check if the runtime arena region was actually provisioned. If setup_static_arena was called with runtime_arena_size == 0, runtime_arena_region_off_ remains SIZE_MAX. Calling region_ptr(SIZE_MAX) results in an invalid pointer calculation and undefined behavior. Following the repository's preference for avoiding undefined behavior when environment setup is incomplete, a silent return is preferred here.

void *DeviceRunner::acquire_pooled_runtime_arena() { if (!static_arena_.is_committed() || runtime_arena_region_off_ == SIZE_MAX) return nullptr; return static_arena_.region_ptr(runtime_arena_region_off_); }

References

Prefer a silent return over proceeding with operations that result in undefined behavior when prerequisite environment setup or conditions are not met.

gemini-code-assist · 2026-05-22T10:03:49Z

+void *DeviceRunner::acquire_pooled_runtime_arena() {
+    if (!static_arena_.is_committed()) return nullptr;
+    return static_arena_.region_ptr(runtime_arena_region_off_);
+}


This method should check if runtime_arena_region_off_ is valid before calling region_ptr. If the arena wasn't provisioned, the offset remains SIZE_MAX, leading to undefined behavior. A defensive check and silent return ensure safety in cases where the environment is not fully initialized, aligning with repository safety standards.

Suggested change

void *DeviceRunner::acquire_pooled_runtime_arena() {

if (!static_arena_.is_committed()) return nullptr;

return static_arena_.region_ptr(runtime_arena_region_off_);

}

void *DeviceRunner::acquire_pooled_runtime_arena() {

if (!static_arena_.is_committed() || runtime_arena_region_off_ == SIZE_MAX) return nullptr;

return static_arena_.region_ptr(runtime_arena_region_off_);

}

References

Prefer a silent return over proceeding with operations that result in undefined behavior when prerequisite environment setup or conditions are not met.

poursoul added 2 commits May 22, 2026 12:22

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: host-build trb runtime arena (a2a3 only)#846

Refactor: host-build trb runtime arena (a2a3 only)#846
poursoul wants to merge 2 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task

poursoul commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poursoul commented May 22, 2026

Summary

Mechanism (commit 2 / d33daa5)

Why a5 is deliberately not touched in this PR

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mechanism (commit 2 / `d33daa5`)