Add: port comm + deferred completion to a5 onboard#823
Open
jvjhfhg wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the HCCL backend for distributed communication on the a5 platform, replacing previous stubs with a functional implementation using ACL IPC primitives. It introduces a symmetric memory pool, updates the DeviceRunner for ACL lifecycle management, and refactors the runtime scheduler to support both counter-based and SDMA event record completion types. Additionally, header guards are modernized to "#pragma once" across several files. Feedback identifies a high-severity issue in the scheduler where the async_ctx.completion_entries array lacks necessary cache invalidation before processing, potentially leading to stale data reads from Global Memory.
c6524eb to
3f5d0d5
Compare
- Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY
IPC windows). SDMA workspace overlay is added in the follow-up
commit so this base alone does not depend on PTO_ISA_ROOT or
libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at
comm_init -- which keeps non-SDMA comm demos unaffected by the
current CANN-9.x SDMA-on-a5 gap.
- Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream
into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on
acl_ready_ in finalize(); preserve raw rtDeviceReset for pure
rt-layer callers.
- Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding
implementations; comm_* C ABI now comes from comm_hccl.cpp.
- Upgrade a5 trb deferred-completion runtime from counter-only to
pluggable backend-ops design: CompletionCondition gains
completion_type/addr/retired fields, CompletionBackendOps table
routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler
invalidates counter cache lines before polling and retires
satisfied conditions.
- Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant
until a kernel registers a SDMA condition; a5 pto-isa already
exposes SDMA via PTO_NPU_ARCH_A5).
- a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on
miss).
- Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier
to disambiguate from bisheng's enum class Stride).
- Enable a5 in allreduce_distributed and test_platform_comm platform
marks; parametrize the latter via st_platform.
- Convert ported runtime headers to #pragma once on both arches so
aicore_completion_mailbox.h / pto_completion_token.h /
pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte-
identical across a2a3 and a5.
Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all
clean. No hardware tests run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.