Skip to content

Refactor: host-build trb runtime arena (a2a3 only)#846

Open
poursoul wants to merge 2 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task
Open

Refactor: host-build trb runtime arena (a2a3 only)#846
poursoul wants to merge 2 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

Summary

Two-commit refactor on the trb runtime, both authored by @poursoul. The PR
bundles them because the second is built on top of the first; squash-merge
gives a single coherent landing.

  1. fe5d662 — Refactor: defer slot_state payload/task bind to orch::prepare_task

    • Lifts the O(task_window_size) slot bind loop out of RingSchedState::init
      into per-submit prepare_task, making startup independent of window size.
    • Mirrored across both a2a3 and a5 trb runtimes (touches
      pto_orchestrator.cpp, pto_runtime2_types.h, pto_scheduler.cpp on
      each arch).
  2. d33daa5 — Refactor: host-build trb runtime arena, AICPU does only wire + SM reseta2a3 only

    • Moves the entire trb runtime arena layout + data init from AICPU's
      runtime_create_from_sm onto the host. AICPU boot becomes a cheap
      arena-internal pointer wire pass + the SM reset that can't run off-device.
    • Pooled prebuilt image lives in the same DeviceRunner static_arena as
      gm_heap and SM (one rtMalloc per worker), reused across all subsequent
      runs via a single rtMemcpy.
    • Scope is intentionally a2a3-only: src/a5/** is untouched in this
      commit (a5 keeps its current AICPU-side runtime_create_from_sm path).
      The plan is to mirror to a5 in a follow-up PR after this lands and
      stabilizes on a2a3 hardware/sim.

Mechanism (commit 2 / d33daa5)

  • DeviceArena::attach() wraps an externally-owned buffer; re-attach is
    permitted so each AICPU boot can reuse the pooled image.
  • runtime_create_from_sm split into reserve_layout / init_data_from_layout
    / wire_arena_pointers / finalize_after_wire; orchestrator / scheduler /
    tensor_map / ready_queue / spsc gain matching data+wire pairs.
    finalize_after_wire stays AICPU-only since it binds s_runtime_ops.
  • pto2_sm_layout helper computes SM device-side field addresses by pure
    offset arithmetic so host init never dereferences SM.
  • Per-slot SM-side reset moved from RingSchedState::init into
    PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it.
  • New file runtime/shared/pto_runtime2_init.cpp holds the host-able pieces
    lifted out of pto_runtime2.cpp / pto_orchestrator.cpp /
    pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay put.
  • DeviceRunner::setup_static_arena now takes a third runtime_arena_size
    region (hbg passes 0 — hbg has no prebuilt runtime arena).

Why a5 is deliberately not touched in this PR

The host-build refactor is a non-trivial reshape of the runtime arena init
path. Keeping a5 on the old AICPU-side path until a2a3 has time on real
hardware lets us validate the new contract (layout/init/wire/finalize phases,
pooled image lifecycle, SM-reset boundary) without making a5 a moving target.
Once stable, the a5 mirror is a mechanical follow-up.

Test plan

  • cpput: 25/25 pass — ready_queue / spsc_queue / scheduler_state /
    task_state / wiring / tensormap UTs migrated to the data+wire API.
    task_allocator.init grew an optional initial_local_task_id (default
    0) so the near-INT32_MAX corner case is still exercised without an SM
    dereference.
  • a2a3sim trb: standalone (dynamic_register variants, L3
    group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
  • a2a3sim host_build_graph: 9/9 pass — verifies the shared HostApi
    changes (3-arg setup_static_arena, new acquire_pooled_runtime_arena
    field) don't break hbg.
  • a2a3 hardware: tests/st/.../paged_attention_unroll passes on
    device 9 (--build, pto-isa commit pinned to CI).

poursoul added 2 commits May 22, 2026 12:22
Move the per-slot payload/task pointer assignments out of the
RingSchedState::init() O(task_window_size) loop and into orch::prepare_task.
Their value is per-slot constant (&task_payloads[slot] /
&task_descriptors[slot]) but writing them at submit time, on the same 64B
slot_state cache line prepare_task is already dirtying, is essentially
free — while removing the only "scale-dependent" pointer assignments from
the init path. ring_id stays in init (its value is per-ring constant, so
rewriting it each submit would only add noise without removing a loop).

Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and
bind_buffers() (per-submit) to make the two call-site shapes explicit.

Mirrored across both a2a3 and a5 trb runtimes.
Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime,
orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper,
mailbox) on every device boot via runtime_create_from_sm. This commit
moves layout + data init onto the host so the AICPU only does a cheap
arena-internal pointer wire pass plus the SM reset that can't run
off-device. Multi-run boots reuse the pooled prebuilt image with a
single rtMemcpy.

Mechanism
- DeviceArena::attach() wraps an externally-owned buffer; re-attach is
  permitted so each AICPU boot can reuse the pooled image.
- runtime_create_from_sm split into reserve_layout / init_data_from_layout
  / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler /
  tensor_map / ready_queue / spsc gain matching data+wire pairs;
  finalize_after_wire stays AICPU-only since it binds s_runtime_ops.
- pto2_sm_layout helper computes SM field device addresses by pure
  offset arithmetic so host init never dereferences SM.
- Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask)
  moved from RingSchedState::init into
  PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns
  it after the split.
- runtime/shared/pto_runtime2_init.cpp — new file holding the host-able
  pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp /
  pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch
  stay in place.

Host wiring (runtime_maker.cpp)
- DeviceRunner::setup_static_arena gains a third runtime_arena_size
  region (hbg passes 0). The prebuilt image lives in the same pooled
  backing allocation as gm_heap and SM, keeping worker lifetime to one
  rtMalloc.
- bind_prepared_to_runtime_impl reserves layout on a host arena, sizes
  the pooled regions, runs init_data + wire, stashes prebuilt metadata
  into the rt image, rtMemcpys to device, and records base/offset on
  Runtime so the AICPU boot can find it.

AICPU boot (aicpu_executor.cpp)
- attach the runtime arena to the pooled buffer, take rt from
  base+off_runtime, wire arena-internal pointers, sm_handle->init
  (SM reset including the per-slot fields above), mailbox reset,
  finalize_after_wire (ops table + cluster/aiv counts).

Tests
- cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state /
  task_state / wiring / tensormap UTs migrated to the data+wire API.
  task_allocator.init grew an optional initial_local_task_id (default
  0) so UTs can still exercise task_id near INT32_MAX without reading
  the SM.
- a2a3sim trb: standalone (dynamic_register variants, L3
  group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
- a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi
  changes don't break hbg).
- a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9
  (--build with pto-isa commit pinned to CI).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.

Comment on lines +299 to +302
void *DeviceRunner::acquire_pooled_runtime_arena() {
if (!static_arena_.is_committed()) return nullptr;
return static_arena_.region_ptr(runtime_arena_region_off_);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The acquire_pooled_runtime_arena method should defensively check if the runtime arena region was actually provisioned. If setup_static_arena was called with runtime_arena_size == 0, runtime_arena_region_off_ remains SIZE_MAX. Calling region_ptr(SIZE_MAX) results in an invalid pointer calculation and undefined behavior. Following the repository's preference for avoiding undefined behavior when environment setup is incomplete, a silent return is preferred here.

void *DeviceRunner::acquire_pooled_runtime_arena() {
    if (!static_arena_.is_committed() || runtime_arena_region_off_ == SIZE_MAX) return nullptr;
    return static_arena_.region_ptr(runtime_arena_region_off_);
}
References
  1. Prefer a silent return over proceeding with operations that result in undefined behavior when prerequisite environment setup or conditions are not met.

Comment on lines +171 to +174
void *DeviceRunner::acquire_pooled_runtime_arena() {
if (!static_arena_.is_committed()) return nullptr;
return static_arena_.region_ptr(runtime_arena_region_off_);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method should check if runtime_arena_region_off_ is valid before calling region_ptr. If the arena wasn't provisioned, the offset remains SIZE_MAX, leading to undefined behavior. A defensive check and silent return ensure safety in cases where the environment is not fully initialized, aligning with repository safety standards.

Suggested change
void *DeviceRunner::acquire_pooled_runtime_arena() {
if (!static_arena_.is_committed()) return nullptr;
return static_arena_.region_ptr(runtime_arena_region_off_);
}
void *DeviceRunner::acquire_pooled_runtime_arena() {
if (!static_arena_.is_committed() || runtime_arena_region_off_ == SIZE_MAX) return nullptr;
return static_arena_.region_ptr(runtime_arena_region_off_);
}
References
  1. Prefer a silent return over proceeding with operations that result in undefined behavior when prerequisite environment setup or conditions are not met.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant