Change GAP9 L3 dma to Async#198
Conversation
📝 WalkthroughWalkthrough
ChangesGAP9 DMA updates
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
After switching gap9L3DmaHack to the async GAP9L3Dma(), the blocking adapter is no longer referenced. Remove the now-unused import and the dead commented-out line to satisfy autoflake/yapf linting.
A transfer whose rank exceeds the DMA's supported rank is decomposed by AnydimAsyncDmaTransferAdapter into a loop of lower-rank copies that all share the single per-tensor request handle (future). With the GAP9 L3 DMA now asynchronous, those copies are issued back-to-back and the in-flight ones overwrite the shared handle before it is waited -> request reuse -> UDMA overrun, observed as a cluster hang spinning in pi_cl_ram_copy_wait. Wait for each decomposed sub-transfer before issuing the next so only one copy is ever in flight on the shared handle. Non-decomposed transfers (the common rank-2 case) are untouched and stay fully asynchronous, so the double-buffer compute/transfer overlap is preserved for them. Add a Transpose3D regression test (small-L1 NCHW->NHWC transpose) whose L3 tile tiles into a rank-3 transfer and exercises the decomposition path; it hangs without this fix and passes with it. Extend the GAP9 L3 double-buffer model list with the validated MLPerf / Attention / Transformer networks.
Victor-Jung
left a comment
There was a problem hiding this comment.
Thanks for the fix, I also suspect the mchan_transfer_t struct to be accessed by the mchan_transfer_push_2d or other runtime functions while being overwritten...
Is the change to AnydimAsyncDmaTransferAdapter only affecting L3 transfers or also L2 transfers?
| # transpose tiles into a rank-3 transfer that the Anydim adapter splits into a loop | ||
| # of 2D copies sharing one request handle. Hangs under a naive async L3 DMA; passes | ||
| # once the decomposition waits each sub-copy. | ||
| "Kernels/FP32/Transpose3D": [2000], |
There was a problem hiding this comment.
This is not a model and should be moved into a L3_DOUBLEBUFFER_KERNELS category.
Port of pulp-platform/Deeploy#198. AnydimAsyncDmaTransferAdapter decomposes a transfer whose rank exceeds the DMA's supported rank into a loop of sub-transfers (e.g. rank-3 L3 tile -> loop of 2D pi_cl_ram_copy_2d). All the sub-transfers share the single per-tensor request handle (future). Issued through an async DMA, the in-flight sub-transfers overwrite the handle before it is waited -> request reuse -> UDMA overrun -> hang in the downstream kernel (observed on GAP9 as im2col_conv2d_fw_kernel hang under async L3 double-buffer for CHW convs). Fix at the adapter level: emit a future.wait() after each decomposed sub-transfer so only one sub-copy is ever in flight on the shared handle. Only serializes the decomposition; non-decomposed (rank <= kernelRank) transfers stay fully async. This makes async L3 DMA correct for CHW convs and strided ConvGrad transfers without needing the per-target blocking wrappers or per-target multi-request pools. Once pulp-platform/Deeploy#198 merges and we rebase upstream, this commit can be dropped in favour of the upstream version.
…c DMA With the Anydim decomposition fix in place (see previous commit, port of pulp-platform/Deeploy#198), the GAP9 L3 async DMA is safe for every transfer shape, including CHW conv per-channel decompositions and strided 2D ConvGrad. The BlockingDmaFromAsyncDmaAdapter wrapper around gap9L3DmaHack was there only to sidestep the Anydim bug; drop it. - gap9L3DmaHack now == GAP9L3Dma() directly (async). - BlockingDmaFromAsyncDmaAdapter import removed from L3Dma.py. - The two PULPL3Tiling call sites in GAP9(Cluster)Transformer collapse from PULPL3Tiling(..., gap9L3DmaHack, dbDma = GAP9L3Dma()) to plain PULPL3Tiling(..., gap9L3DmaHack) — SB and DB share the same async DMA, no more split. The now-unused GAP9ClusterBlockingDBTransformer (routing CHW convs through a blocking DB hop) is left alone in this commit for reviewability and dropped in the next one — that way, if some CI job trips on this change, only this commit needs to be reverted, not the transformer deletion + routing swap on top.
…onvs GAP9ClusterBlockingDBTransformer existed solely to route CHW forward convs through a fully blocking L3 hop, because the per-channel decomposition under an async DB hop reused the shared per-tensor request handle -> UDMA overrun -> hang in im2col_conv2d_fw_kernel. With the Anydim adapter now serializing decomposed sub-transfers on the shared handle (port of pulp-platform/Deeploy#198, two commits back), the async L3 hop is safe for CHW convs too. - Delete GAP9ClusterBlockingDBTransformer entirely. - Reroute GAP9FloatConv2DCHWBindings and GAP9FloatDWConv2DCHWBindings from GAP9ClusterBlockingDBTransformer to plain GAP9ClusterTransformer, so CHW convs now get the same async L3 DB overlap as HWC convs. Expected on MobileNetV1 training DB: recovers the ~3% CHW-DB perf that the parent revert of PR #39 gave up, and pulls even with (or ahead of) the MultiReq pool approach — but through the simpler ~7-line adapter fix instead of ~90 lines of pool + stamping infrastructure.
Same story as the previous GAP9 commit, on the PULPOpen (Siracusa) side. The l3DmaBlocking variable existed only to give SB hops a blocking L3 DMA because 'strided-2D deferred-wait corruption/deadlock risk' — that risk is the same Anydim decomposition bug the adapter fix (port of pulp-platform/Deeploy#198) now covers. - Delete l3DmaBlocking + drop BlockingDmaFromAsyncDmaAdapter import from PULPOpen/DMA/L3Dma.py. - Both PULPL3Tiling call sites in PULPOpen/Bindings.py (Transformer, ClusterTransformer) collapse to PULPL3Tiling('L3', 'L2', l3DmaHack) — SB and DB share the same async DMA. Siracusa L3 SB tests (ResNet8/MobileNetV1/CCT training + inference) were the only consumers of the old blocking path. They now run through the async DMA + the serialize-in-decomposition adapter fix. If any of them regress in CI, drop this commit and the preceding cleanup(gap9) commits in reverse order until the failure clears — the adapter fix commit itself should be kept regardless since it is required by the rest.
…platform/Deeploy#198 (#43) * revert(gap9): drop MultiReq L3 DMA — superseded by pulp-platform/Deeploy#198 Reverts c799815 (PR #39, "multi-request async L3 DMA — unlock double- buffering for CHW convs"). Upstream pulp-platform/Deeploy#198 ("Change GAP9 L3 dma to Async") fixes the same underlying bug in seven lines: * root cause (both PRs agree): AnydimAsyncDmaTransferAdapter decomposes a rank>2 L3 transfer into a loop of 2D pi_cl_ram_copy_2d sub-copies that all share a single pi_cl_ram_req_t handle. Under async double-buffering the in-flight sub-copies overwrite the handle before it is waited -> UDMA overrun -> hang in im2col_conv2d_fw_kernel. * #198's fix: append future.wait() after each decomposed sub-copy inside AnydimAsyncDmaTransferAdapter.transfer. Serializes only the decomposition on the shared handle; non-decomposed transfers stay fully async. gap9L3DmaHack drops the BlockingDmaFromAsyncDmaAdapter and becomes plain GAP9L3Dma() -- correct for every path, including CHW convs, without any codegen or binding changes. * PR #39's fix (this revert): a per-tensor pool of pi_cl_ram_req_t sized to the exact decomposition count, plus new stamping fields (_maxReq / _poolSize / _reqIndexExpr) on futures, plus a dedicated GAP9ClusterBlockingDBTransformer routing CHW convs through the pool. ~90 lines of codegen + a new transformer + a new binding path. Both fixes are correct. On GAP9's single-channel HyperRAM the "pool of requests" concurrency is not realizable in hardware -- sub-copies still queue on the one DMA engine -- so the measured MobileNetV1 DB win in PR #39 (264.7M -> 256.2M cyc, -3.2%) is the "blocking -> async" gain, which #198 achieves the simpler way. Cost of this revert until upstream #198 lands and we rebase: * MobileNetV1 training L3 DB regresses ~3% (CHW conv's dbDma falls back to the blocking gap9L3DmaHack, same state as before PR #39). * ResNet8 (HWC) and CCT L3 DB: unaffected (never used MultiReq). * All models remain functionally correct. Once #198 is merged, a follow-up will also collapse the SB=blocking / DB=async split in GAP9Transformer / GAP9ClusterTransformer (currently justified by "strided 2D ConvGrad L3 transfers"), and delete the whole GAP9ClusterBlockingDBTransformer variant + its routing. * fix(dma): serialize Anydim decomposition on shared future handle Port of pulp-platform/Deeploy#198. AnydimAsyncDmaTransferAdapter decomposes a transfer whose rank exceeds the DMA's supported rank into a loop of sub-transfers (e.g. rank-3 L3 tile -> loop of 2D pi_cl_ram_copy_2d). All the sub-transfers share the single per-tensor request handle (future). Issued through an async DMA, the in-flight sub-transfers overwrite the handle before it is waited -> request reuse -> UDMA overrun -> hang in the downstream kernel (observed on GAP9 as im2col_conv2d_fw_kernel hang under async L3 double-buffer for CHW convs). Fix at the adapter level: emit a future.wait() after each decomposed sub-transfer so only one sub-copy is ever in flight on the shared handle. Only serializes the decomposition; non-decomposed (rank <= kernelRank) transfers stay fully async. This makes async L3 DMA correct for CHW convs and strided ConvGrad transfers without needing the per-target blocking wrappers or per-target multi-request pools. Once pulp-platform/Deeploy#198 merges and we rebase upstream, this commit can be dropped in favour of the upstream version. * cleanup(gap9): drop L3 blocking wrapper, collapse SB/DB into one async DMA With the Anydim decomposition fix in place (see previous commit, port of pulp-platform/Deeploy#198), the GAP9 L3 async DMA is safe for every transfer shape, including CHW conv per-channel decompositions and strided 2D ConvGrad. The BlockingDmaFromAsyncDmaAdapter wrapper around gap9L3DmaHack was there only to sidestep the Anydim bug; drop it. - gap9L3DmaHack now == GAP9L3Dma() directly (async). - BlockingDmaFromAsyncDmaAdapter import removed from L3Dma.py. - The two PULPL3Tiling call sites in GAP9(Cluster)Transformer collapse from PULPL3Tiling(..., gap9L3DmaHack, dbDma = GAP9L3Dma()) to plain PULPL3Tiling(..., gap9L3DmaHack) — SB and DB share the same async DMA, no more split. The now-unused GAP9ClusterBlockingDBTransformer (routing CHW convs through a blocking DB hop) is left alone in this commit for reviewability and dropped in the next one — that way, if some CI job trips on this change, only this commit needs to be reverted, not the transformer deletion + routing swap on top. * cleanup(gap9): delete GAP9ClusterBlockingDBTransformer, reroute CHW convs GAP9ClusterBlockingDBTransformer existed solely to route CHW forward convs through a fully blocking L3 hop, because the per-channel decomposition under an async DB hop reused the shared per-tensor request handle -> UDMA overrun -> hang in im2col_conv2d_fw_kernel. With the Anydim adapter now serializing decomposed sub-transfers on the shared handle (port of pulp-platform/Deeploy#198, two commits back), the async L3 hop is safe for CHW convs too. - Delete GAP9ClusterBlockingDBTransformer entirely. - Reroute GAP9FloatConv2DCHWBindings and GAP9FloatDWConv2DCHWBindings from GAP9ClusterBlockingDBTransformer to plain GAP9ClusterTransformer, so CHW convs now get the same async L3 DB overlap as HWC convs. Expected on MobileNetV1 training DB: recovers the ~3% CHW-DB perf that the parent revert of PR #39 gave up, and pulls even with (or ahead of) the MultiReq pool approach — but through the simpler ~7-line adapter fix instead of ~90 lines of pool + stamping infrastructure. * cleanup(pulpopen): drop l3DmaBlocking, PULPL3Tiling now single async DMA Same story as the previous GAP9 commit, on the PULPOpen (Siracusa) side. The l3DmaBlocking variable existed only to give SB hops a blocking L3 DMA because 'strided-2D deferred-wait corruption/deadlock risk' — that risk is the same Anydim decomposition bug the adapter fix (port of pulp-platform/Deeploy#198) now covers. - Delete l3DmaBlocking + drop BlockingDmaFromAsyncDmaAdapter import from PULPOpen/DMA/L3Dma.py. - Both PULPL3Tiling call sites in PULPOpen/Bindings.py (Transformer, ClusterTransformer) collapse to PULPL3Tiling('L3', 'L2', l3DmaHack) — SB and DB share the same async DMA. Siracusa L3 SB tests (ResNet8/MobileNetV1/CCT training + inference) were the only consumers of the old blocking path. They now run through the async DMA + the serialize-in-decomposition adapter fix. If any of them regress in CI, drop this commit and the preceding cleanup(gap9) commits in reverse order until the failure clears — the adapter fix commit itself should be kept regardless since it is required by the rest. * style: drop unused GAP9L3Dma import from Bindings.py Left over from the previous cleanup commit which removed the last GAP9L3Dma() call site (the SB/DB split's dbDma argument). Only gap9L3DmaHack is still imported from L3Dma. Fixes: pre-commit / autoflake (Lint & Licenses CI). * chore: apply pre-commit auto-fixes for pre-existing drift on devel devel is already lint-red on these two files; pre-commit --all-files on CI catches them for every PR. Applying yapf + clang-format outputs here so this PR passes Lint & Licenses without waiting for a separate housekeeping PR. Both fixes are format-only: * MemoryAllocation.py: yapf drops one duplicated space in a keyword arg. * deeploytraintest.c: clang-format re-wraps two comment blocks (SLAVESTACKSIZE and stack-placement explanation) at a shorter width. No functional change.
Switch the GAP9 L3 (HyperRAM) DMA from the blocking adapter to the native asynchronous GAP9L3Dma, so the L3↔L2 hop overlaps with cluster compute under double-buffering instead of stalling on every tile transfer.
AnydimAsyncDmaTransferAdapterdecomposes a rank-N transfer into a loop of rank-Ksub-transfers, and reuses one per-tensor Future — i.e. one pi_cl_ram_req_t struct
— across every iteration:
pi_cl_ram_req_t tensor_input = {0}; // one struct, shared by all iterationsThe src / dst / size / stride / length arguments do differ per iteration, but
pi_cl_ram_req_t is not a passive argument container. PMSIS uses it as the
driver's own queue node:
Every pi_cl_ram_copy_2d(&req, …) (a) stamps the parameter fields into the struct,
(b) links req->next = queue_head; queue_head = &req, so the user-owned struct
becomes part of the driver's request queue, then (c) programs µDMA and returns.
When iter-1 fires while iter-0 is still in flight, three things break at once on
the same memory:
iter-1's values → the pending µDMA descriptor is corrupted;
self-referential;
can't distinguish which iteration finished.
Result: µDMA reads a garbled descriptor, the queue traversal wedges, and the
cluster hangs in pi_cl_ram_copy_wait — the hang observed on GAP9 CHW convs and
strided 2D ConvGrad L3 transfers.
Generated C:
Added
Changed
Fixed
Trade-off: the decomposed rank-3 sub-copies are serialized, so those specific transfers don't overlap with compute. This affects only rank-3 (e.g. channels-first) L3 tiles, which no upstream network produces — every upstream GAP9/Siracusa network uses rank-2 L3 transfers and keeps full async overlap. If overlap is ever needed for the decomposed case, a per-sub-transfer request pool can be layered on top without changing this path.
PR Merge Checklist