Skip to content

feat(runtime): add L3/L2 host-device communication#803

Open
PKUZHOU wants to merge 2 commits into
hw-native-sys:mainfrom
PKUZHOU:feat/l3-l2-send-recv-v1
Open

feat(runtime): add L3/L2 host-device communication#803
PKUZHOU wants to merge 2 commits into
hw-native-sys:mainfrom
PKUZHOU:feat/l3-l2-send-recv-v1

Conversation

@PKUZHOU
Copy link
Copy Markdown
Contributor

@PKUZHOU PKUZHOU commented May 18, 2026

Summary

  • add L3/L2 send/recv channels with multi-lane SPSC shared-ring semantics
  • add host/device shared-memory read/write regions with software 64-bit notify/wait signals
  • expose both features through the runtime C ABI, ChipWorker, Worker, Orchestrator, and Python bindings
  • wire a2a3/a5 sim backends, a2a3 onboard host-mapped memory, and a5 onboard unsupported stubs
  • add focused C++ coverage for channel and shared-memory core behavior

Tests

  • git diff --check
  • python3 -m py_compile python/simpler/worker.py python/simpler/orchestrator.py python/simpler/task_interface.py
  • cmake -S tests/ut/cpp -B /tmp/simpler-ut-cpp-hdmem
  • cmake --build /tmp/simpler-ut-cpp-hdmem --target test_host_device_memory -j4
  • /tmp/simpler-ut-cpp-hdmem/test_host_device_memory
  • cmake --build /tmp/simpler-ut-cpp-hdmem --target test_host_device_channel -j4 && /tmp/simpler-ut-cpp-hdmem/test_host_device_channel
  • cmake -S . -B /tmp/simpler-bind-hdmem -Dnanobind_DIR=/home/ntlab/miniconda3/lib/python3.13/site-packages/nanobind/cmake
  • cmake --build /tmp/simpler-bind-hdmem --target _task_interface -j4
  • python3 simpler_setup/build_runtimes.py --lib-dir /tmp/simpler-hdmem-lib --cache-dir /tmp/simpler-hdmem-cache
  • PYTHONPATH=python python3 -S a2a3 sim shared-memory smoke

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a bounded host-device message channel (HDCH) using a ring-buffer design in shared memory, providing platform-specific backends for hardware and simulation along with Python bindings for Orchestrator and ChipWorker. The review feedback identifies critical security vulnerabilities, specifically potential buffer overflows and TOCTOU issues when reading from shared memory. Additionally, recommendations were made to optimize performance by eliminating redundant data copies in the binding and orchestration layers and to resolve a data race in the HAL library initialization using atomic synchronization.

uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us
) {
if (lanes == nullptr || cursor == nullptr || (data == nullptr && nbytes != 0)) return HDCH_ERR_INVALID;
if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The nbytes check only validates against max_message_bytes, which is read from shared memory and could be corrupted. To prevent a buffer overflow in the subsequent memcpy into slot->inline_data, you must also explicitly validate nbytes against the hard-coded HDCH_MAX_INLINE_BYTES capacity.

Suggested change
if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE;
if (nbytes > max_message_bytes || nbytes > HDCH_MAX_INLINE_BYTES) return HDCH_ERR_MSG_TOO_LARGE;
References
  1. Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows caused by potential memory corruption or builder bypasses.

Comment on lines +120 to +124
if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE;
if (slot->payload_bytes != 0) {
memcpy(dst, slot->inline_data, slot->payload_bytes);
}
*out_nbytes = slot->payload_bytes;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

Reading slot->payload_bytes multiple times from shared memory is vulnerable to TOCTOU issues if the device-side modifies it. Furthermore, it must be validated against HDCH_MAX_INLINE_BYTES to prevent reading past the end of the inline_data buffer. In this polling context, clamping to a safe range is preferred to avoid noisy failures.

Suggested change
if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE;
if (slot->payload_bytes != 0) {
memcpy(dst, slot->inline_data, slot->payload_bytes);
}
*out_nbytes = slot->payload_bytes;
uint32_t n = slot->payload_bytes;
if (n > HDCH_MAX_INLINE_BYTES || n > dst_capacity) n = 0;
if (n != 0) {
memcpy(dst, slot->inline_data, n);
}
*out_nbytes = n;
References
  1. Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows.
  2. In a polling loop that reads from shared memory, prefer clamping invalid values to a safe range over throwing an exception for transient inconsistencies.

Comment on lines +43 to +66
namespace {
void *g_hdch_hal_handle = nullptr;
constexpr unsigned int HDCH_DEV_SVM_MAP_HOST = 2U;

using HdchHalHostRegisterFn = int (*)(void *dev_ptr, size_t size, unsigned int flags, int device_id, void **host_ptr);
using HdchHalHostUnregisterFn = int (*)(void *host_ptr, int device_id);

int hdch_load_hal_if_needed() {
if (g_hdch_hal_handle != nullptr) return 0;
g_hdch_hal_handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_LOCAL);
return g_hdch_hal_handle == nullptr ? -1 : 0;
}

HdchHalHostRegisterFn hdch_get_halHostRegister() {
if (g_hdch_hal_handle == nullptr) return nullptr;
return reinterpret_cast<HdchHalHostRegisterFn>(dlsym(g_hdch_hal_handle, "halHostRegister"));
}

HdchHalHostUnregisterFn hdch_get_halHostUnregister() {
if (g_hdch_hal_handle == nullptr) return nullptr;
return reinterpret_cast<HdchHalHostUnregisterFn>(dlsym(g_hdch_hal_handle, "halHostUnregister"));
}

} // namespace
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The initialization of g_hdch_hal_handle lacks synchronization, leading to a data race. Use compare_exchange_strong for concurrent idempotent initialization. Per repository rules, use RTLD_GLOBAL with dlopen for symbol visibility and avoid scope-bound RAII for cached handles, managing their lifecycle explicitly. Also, ensure dlclose is called if initialization fails and cache the results of dlsym calls to minimize overhead.

static std::atomic<void*> g_hdch_hal_handle{nullptr};
static HdchHalHostRegisterFn g_halHostRegister = nullptr;
static HdchHalHostUnregisterFn g_halHostUnregister = nullptr;

int hdch_load_hal_if_needed() {
    if (g_hdch_hal_handle.load() != nullptr) return 0;
    void* handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_GLOBAL);
    if (!handle) return -1;
    try {
        auto reg = reinterpret_cast<HdchHalHostRegisterFn>(dlsym(handle, "halHostRegister"));
        auto unreg = reinterpret_cast<HdchHalHostUnregisterFn>(dlsym(handle, "halHostUnregister"));
        void* expected = nullptr;
        if (g_hdch_hal_handle.compare_exchange_strong(expected, handle)) {
            g_halHostRegister = reg;
            g_halHostUnregister = unreg;
            return 0;
        }
        dlclose(handle);
        return 0;
    } catch (...) {
        dlclose(handle);
        return -1;
    }
}
References
  1. Use compare_exchange_strong for concurrent idempotent initialization of shared members to ensure the value is published exactly once.
  2. Use the RTLD_GLOBAL flag with dlopen when symbols from the loaded library need to be resolved by other shared libraries that are loaded later.
  3. Do not use scope-bound RAII for resources (such as dlopen handles) that are intentionally cached for reuse across multiple execution runs.
  4. To prevent resource leaks with handles acquired via dlopen, use a try-catch block to ensure dlclose is called if any subsequent initialization steps fail.
  5. Cache the results of extern "C" function calls outside of performance-critical loops to minimize overhead from repeated calls across the language boundary.

Comment on lines +762 to +763
std::string payload(data.c_str(), data.size());
self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating an intermediate std::string from nb::bytes is redundant and incurs an unnecessary copy. nb::bytes provides direct access to the underlying buffer via c_str() and size(), which can be passed directly to self.channel_send.

Suggested change
std::string payload(data.c_str(), data.size());
self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);
self.channel_send(ch, route, data.c_str(), data.size(), correlation_id, timeout_us);

Comment on lines +782 to +783
std::string payload(data.c_str(), data.size());
self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating an intermediate std::string from nb::bytes is redundant and incurs an unnecessary copy. nb::bytes provides direct access to the underlying buffer via c_str() and size(), which can be passed directly to self.channel_send_l2_for_test.

Suggested change
std::string payload(data.c_str(), data.size());
self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);
self.channel_send_l2_for_test(ch, route, data.c_str(), data.size(), correlation_id, timeout_us);

Comment on lines +185 to +187
std::string payload(data.c_str(), data.size());
std::vector<uint8_t> bytes(payload.begin(), payload.end());
self.channel_send(worker_id, channel, route, bytes, correlation_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation performs multiple redundant copies (nb::bytes -> std::string -> std::vector<uint8_t>). By updating the Orchestrator::channel_send signature to take a raw pointer and size, you can pass the data directly from nb::bytes without any intermediate allocations.

                self.channel_send(worker_id, channel, route, data.c_str(), data.size(), correlation_id);

uint32_t max_message_bytes
);
void close_channel(int worker_id, uint64_t ch);
void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing std::vector<uint8_t> by value (or even by const reference when constructed from a binding) forces unnecessary data copies. Changing the signature to take const void* and size_t allows for zero-copy pass-through from the Python bindings to the underlying WorkerThread control command.

Suggested change
void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id);
void channel_send(int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id);

Comment on lines +72 to +78
void Orchestrator::channel_send(
int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id
) {
auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id);
if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id");
wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the implementation to match the more efficient const void* signature, avoiding the overhead of std::vector construction.

Suggested change
void Orchestrator::channel_send(
int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id
) {
auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id);
if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id");
wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id);
}
void Orchestrator::channel_send(
int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id
) {
auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id);
if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id");
wt->control_channel_send(ch, route, data, size, correlation_id);
}

@PKUZHOU PKUZHOU changed the title feat(runtime): add L3 L2 host-device send recv channel feat(runtime): add L3/L2 host-device communication May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant