Change GAP9 L3 dma to Async by runwangdl · Pull Request #198 · pulp-platform/Deeploy

runwangdl · 2026-06-29T21:25:07Z

Switch the GAP9 L3 (HyperRAM) DMA from the blocking adapter to the native asynchronous GAP9L3Dma, so the L3↔L2 hop overlaps with cluster compute under double-buffering instead of stalling on every tile transfer.

AnydimAsyncDmaTransferAdapter decomposes a rank-N transfer into a loop of rank-K
sub-transfers, and reuses one per-tensor Future — i.e. one pi_cl_ram_req_t struct
— across every iteration:
pi_cl_ram_req_t tensor_input = {0}; // one struct, shared by all iterations

  for (int i = 0; i < N; i++) {
      void*    dst_i = local    + i * loc_stride;
      uint32_t src_i = external + i * ext_stride;
      pi_cl_ram_copy_2d(get_ram_ptr(),
                        src_i, dst_i, size, stride, length, ext2loc,
                        &tensor_input);       // fires async, returns immediately
  }
  pi_cl_ram_copy_wait(&tensor_input);         // one wait after the whole loop

The src / dst / size / stride / length arguments do differ per iteration, but
pi_cl_ram_req_t is not a passive argument container. PMSIS uses it as the
driver's own queue node:

  struct pi_cl_ram_req_s {
      void            *addr;         // transfer params
      uint32_t         pi_ram_addr;  //   (overwritten by every pi_cl_ram_copy_2d)
      uint32_t         size;
      int32_t          stride;
      uint32_t         length;
      pi_evt_t         event;        // driver-owned completion event
      struct pi_cl_ram_req_s *next;  // driver's linked-list link
      uint8_t          done;         // driver-owned completion flag
      // ...
  };

Every pi_cl_ram_copy_2d(&req, …) (a) stamps the parameter fields into the struct,
(b) links req->next = queue_head; queue_head = &req, so the user-owned struct
becomes part of the driver's request queue, then (c) programs µDMA and returns.

When iter-1 fires while iter-0 is still in flight, three things break at once on
the same memory:

iter-0's addr / pi_ram_addr / size / stride / length are overwritten with
iter-1's values → the pending µDMA descriptor is corrupted;
req->next = queue_head = &req → the driver's linked list becomes
self-referential;
done / event can only track one completion → the final pi_cl_ram_copy_wait
can't distinguish which iteration finished.

Result: µDMA reads a garbled descriptor, the queue traversal wedges, and the
cluster hangs in pi_cl_ram_copy_wait — the hang observed on GAP9 CHW convs and
strided 2D ConvGrad L3 transfers.

Generated C:

  for (int i = 0; i < N; i++) {
      pi_cl_ram_copy_2d(..., &tensor_input);          // fire
      if (tensor_input.size != 0) {                   // ← inline wait
          pi_cl_ram_copy_wait(&tensor_input);
      }
  }

Added

Tests/Kernels/FP32/Transpose3D regression test: a small-L1 NCHW→NHWC transpose whose L3 tile becomes a rank-3 transfer and exercises the Anydim decomposition path. Hangs without the fix, passes with it.
GAP9 L3 double-buffer coverage: added the validated MLPerf (KeywordSpotting / ImageClassification / AnomalyDetection), Attention and Transformer networks to L3_DOUBLEBUFFER_MODELS.

Changed

gap9L3DmaHack now uses the asynchronous GAP9L3Dma() directly instead of wrapping it in BlockingDmaFromAsyncDmaAdapter, so the L3 double-buffer hop overlaps transfers with compute.

Fixed

AnydimAsyncDmaTransferAdapter.transfer: when a transfer is decomposed into a loop of lower-rank copies, wait for each sub-copy before issuing the next, so only one copy is ever in flight on the shared request handle. This prevents the async request-reuse overrun/hang on decomposed (rank > kernel-rank) transfers. Non-decomposed (rank-2) transfers are untouched and stay fully asynchronous, preserving the double-buffer overlap.
Dropped the now-unused BlockingDmaFromAsyncDmaAdapter import in GAP9/DMA/L3Dma.py.

Trade-off: the decomposed rank-3 sub-copies are serialized, so those specific transfers don't overlap with compute. This affects only rank-3 (e.g. channels-first) L3 tiles, which no upstream network produces — every upstream GAP9/Siracusa network uses rank-2 L3 transfers and keeps full async overlap. If overlap is ever needed for the decomposed case, a per-sub-transfer request pool can be layered on top without changing this path.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

coderabbitai · 2026-06-29T21:28:33Z

📝 Walkthrough

Walkthrough

gap9L3DmaHack now uses GAP9L3Dma() directly, and decomposed async DMA transfers now wait on the shared future after each sub-transfer. GAP9 tiled test config entries were extended for the affected models and kernels.

Changes

GAP9 DMA updates

Layer / File(s)	Summary
Direct GAP9 L3 DMA export `Deeploy/Targets/GAP9/DMA/L3Dma.py`	`gap9L3DmaHack` is assigned `GAP9L3Dma()` directly and the removed adapter import is dropped.
Wait after decomposed DMA issue `Deeploy/TilingExtension/AsyncDma.py`, `DeeployTest/test_gap9_tiled_config.py`	`AnydimAsyncDmaTransferAdapter.transfer` appends `future.wait()` in the decomposition path, and GAP9 tiled test config adds model and kernel entries for the updated async-L3 cases.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

pulp-platform/Deeploy#143: Overlaps the GAP9 L3 DMA export change by updating gap9L3DmaHack to use GAP9L3Dma() directly.

Suggested labels

Feature

Suggested reviewers

Victor-Jung
Xeratec

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly matches the main change: switching GAP9 L3 DMA to the async path.
Description check	✅ Passed	The description matches the changes, covering the async GAP9 L3 DMA switch, decomposed DMA wait fix, and added regression/test coverage.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

After switching gap9L3DmaHack to the async GAP9L3Dma(), the blocking adapter is no longer referenced. Remove the now-unused import and the dead commented-out line to satisfy autoflake/yapf linting.

A transfer whose rank exceeds the DMA's supported rank is decomposed by AnydimAsyncDmaTransferAdapter into a loop of lower-rank copies that all share the single per-tensor request handle (future). With the GAP9 L3 DMA now asynchronous, those copies are issued back-to-back and the in-flight ones overwrite the shared handle before it is waited -> request reuse -> UDMA overrun, observed as a cluster hang spinning in pi_cl_ram_copy_wait. Wait for each decomposed sub-transfer before issuing the next so only one copy is ever in flight on the shared handle. Non-decomposed transfers (the common rank-2 case) are untouched and stay fully asynchronous, so the double-buffer compute/transfer overlap is preserved for them. Add a Transpose3D regression test (small-L1 NCHW->NHWC transpose) whose L3 tile tiles into a rank-3 transfer and exercises the decomposition path; it hangs without this fix and passes with it. Extend the GAP9 L3 double-buffer model list with the validated MLPerf / Attention / Transformer networks.

Victor-Jung

Thanks for the fix, I also suspect the mchan_transfer_t struct to be accessed by the mchan_transfer_push_2d or other runtime functions while being overwritten...

Is the change to AnydimAsyncDmaTransferAdapter only affecting L3 transfers or also L2 transfers?

Victor-Jung · 2026-07-01T09:10:05Z

+    # transpose tiles into a rank-3 transfer that the Anydim adapter splits into a loop
+    # of 2D copies sharing one request handle. Hangs under a naive async L3 DMA; passes
+    # once the decomposition waits each sub-copy.
+    "Kernels/FP32/Transpose3D": [2000],


This is not a model and should be moved into a L3_DOUBLEBUFFER_KERNELS category.

Port of pulp-platform/Deeploy#198. AnydimAsyncDmaTransferAdapter decomposes a transfer whose rank exceeds the DMA's supported rank into a loop of sub-transfers (e.g. rank-3 L3 tile -> loop of 2D pi_cl_ram_copy_2d). All the sub-transfers share the single per-tensor request handle (future). Issued through an async DMA, the in-flight sub-transfers overwrite the handle before it is waited -> request reuse -> UDMA overrun -> hang in the downstream kernel (observed on GAP9 as im2col_conv2d_fw_kernel hang under async L3 double-buffer for CHW convs). Fix at the adapter level: emit a future.wait() after each decomposed sub-transfer so only one sub-copy is ever in flight on the shared handle. Only serializes the decomposition; non-decomposed (rank <= kernelRank) transfers stay fully async. This makes async L3 DMA correct for CHW convs and strided ConvGrad transfers without needing the per-target blocking wrappers or per-target multi-request pools. Once pulp-platform/Deeploy#198 merges and we rebase upstream, this commit can be dropped in favour of the upstream version.

…c DMA With the Anydim decomposition fix in place (see previous commit, port of pulp-platform/Deeploy#198), the GAP9 L3 async DMA is safe for every transfer shape, including CHW conv per-channel decompositions and strided 2D ConvGrad. The BlockingDmaFromAsyncDmaAdapter wrapper around gap9L3DmaHack was there only to sidestep the Anydim bug; drop it. - gap9L3DmaHack now == GAP9L3Dma() directly (async). - BlockingDmaFromAsyncDmaAdapter import removed from L3Dma.py. - The two PULPL3Tiling call sites in GAP9(Cluster)Transformer collapse from PULPL3Tiling(..., gap9L3DmaHack, dbDma = GAP9L3Dma()) to plain PULPL3Tiling(..., gap9L3DmaHack) — SB and DB share the same async DMA, no more split. The now-unused GAP9ClusterBlockingDBTransformer (routing CHW convs through a blocking DB hop) is left alone in this commit for reviewability and dropped in the next one — that way, if some CI job trips on this change, only this commit needs to be reverted, not the transformer deletion + routing swap on top.

…onvs GAP9ClusterBlockingDBTransformer existed solely to route CHW forward convs through a fully blocking L3 hop, because the per-channel decomposition under an async DB hop reused the shared per-tensor request handle -> UDMA overrun -> hang in im2col_conv2d_fw_kernel. With the Anydim adapter now serializing decomposed sub-transfers on the shared handle (port of pulp-platform/Deeploy#198, two commits back), the async L3 hop is safe for CHW convs too. - Delete GAP9ClusterBlockingDBTransformer entirely. - Reroute GAP9FloatConv2DCHWBindings and GAP9FloatDWConv2DCHWBindings from GAP9ClusterBlockingDBTransformer to plain GAP9ClusterTransformer, so CHW convs now get the same async L3 DB overlap as HWC convs. Expected on MobileNetV1 training DB: recovers the ~3% CHW-DB perf that the parent revert of PR #39 gave up, and pulls even with (or ahead of) the MultiReq pool approach — but through the simpler ~7-line adapter fix instead of ~90 lines of pool + stamping infrastructure.

Same story as the previous GAP9 commit, on the PULPOpen (Siracusa) side. The l3DmaBlocking variable existed only to give SB hops a blocking L3 DMA because 'strided-2D deferred-wait corruption/deadlock risk' — that risk is the same Anydim decomposition bug the adapter fix (port of pulp-platform/Deeploy#198) now covers. - Delete l3DmaBlocking + drop BlockingDmaFromAsyncDmaAdapter import from PULPOpen/DMA/L3Dma.py. - Both PULPL3Tiling call sites in PULPOpen/Bindings.py (Transformer, ClusterTransformer) collapse to PULPL3Tiling('L3', 'L2', l3DmaHack) — SB and DB share the same async DMA. Siracusa L3 SB tests (ResNet8/MobileNetV1/CCT training + inference) were the only consumers of the old blocking path. They now run through the async DMA + the serialize-in-decomposition adapter fix. If any of them regress in CI, drop this commit and the preceding cleanup(gap9) commits in reverse order until the failure clears — the adapter fix commit itself should be kept regardless since it is required by the rest.

…platform/Deeploy#198 (#43) * revert(gap9): drop MultiReq L3 DMA — superseded by pulp-platform/Deeploy#198 Reverts c799815 (PR #39, "multi-request async L3 DMA — unlock double- buffering for CHW convs"). Upstream pulp-platform/Deeploy#198 ("Change GAP9 L3 dma to Async") fixes the same underlying bug in seven lines: * root cause (both PRs agree): AnydimAsyncDmaTransferAdapter decomposes a rank>2 L3 transfer into a loop of 2D pi_cl_ram_copy_2d sub-copies that all share a single pi_cl_ram_req_t handle. Under async double-buffering the in-flight sub-copies overwrite the handle before it is waited -> UDMA overrun -> hang in im2col_conv2d_fw_kernel. * #198's fix: append future.wait() after each decomposed sub-copy inside AnydimAsyncDmaTransferAdapter.transfer. Serializes only the decomposition on the shared handle; non-decomposed transfers stay fully async. gap9L3DmaHack drops the BlockingDmaFromAsyncDmaAdapter and becomes plain GAP9L3Dma() -- correct for every path, including CHW convs, without any codegen or binding changes. * PR #39's fix (this revert): a per-tensor pool of pi_cl_ram_req_t sized to the exact decomposition count, plus new stamping fields (_maxReq / _poolSize / _reqIndexExpr) on futures, plus a dedicated GAP9ClusterBlockingDBTransformer routing CHW convs through the pool. ~90 lines of codegen + a new transformer + a new binding path. Both fixes are correct. On GAP9's single-channel HyperRAM the "pool of requests" concurrency is not realizable in hardware -- sub-copies still queue on the one DMA engine -- so the measured MobileNetV1 DB win in PR #39 (264.7M -> 256.2M cyc, -3.2%) is the "blocking -> async" gain, which #198 achieves the simpler way. Cost of this revert until upstream #198 lands and we rebase: * MobileNetV1 training L3 DB regresses ~3% (CHW conv's dbDma falls back to the blocking gap9L3DmaHack, same state as before PR #39). * ResNet8 (HWC) and CCT L3 DB: unaffected (never used MultiReq). * All models remain functionally correct. Once #198 is merged, a follow-up will also collapse the SB=blocking / DB=async split in GAP9Transformer / GAP9ClusterTransformer (currently justified by "strided 2D ConvGrad L3 transfers"), and delete the whole GAP9ClusterBlockingDBTransformer variant + its routing. * fix(dma): serialize Anydim decomposition on shared future handle Port of pulp-platform/Deeploy#198. AnydimAsyncDmaTransferAdapter decomposes a transfer whose rank exceeds the DMA's supported rank into a loop of sub-transfers (e.g. rank-3 L3 tile -> loop of 2D pi_cl_ram_copy_2d). All the sub-transfers share the single per-tensor request handle (future). Issued through an async DMA, the in-flight sub-transfers overwrite the handle before it is waited -> request reuse -> UDMA overrun -> hang in the downstream kernel (observed on GAP9 as im2col_conv2d_fw_kernel hang under async L3 double-buffer for CHW convs). Fix at the adapter level: emit a future.wait() after each decomposed sub-transfer so only one sub-copy is ever in flight on the shared handle. Only serializes the decomposition; non-decomposed (rank <= kernelRank) transfers stay fully async. This makes async L3 DMA correct for CHW convs and strided ConvGrad transfers without needing the per-target blocking wrappers or per-target multi-request pools. Once pulp-platform/Deeploy#198 merges and we rebase upstream, this commit can be dropped in favour of the upstream version. * cleanup(gap9): drop L3 blocking wrapper, collapse SB/DB into one async DMA With the Anydim decomposition fix in place (see previous commit, port of pulp-platform/Deeploy#198), the GAP9 L3 async DMA is safe for every transfer shape, including CHW conv per-channel decompositions and strided 2D ConvGrad. The BlockingDmaFromAsyncDmaAdapter wrapper around gap9L3DmaHack was there only to sidestep the Anydim bug; drop it. - gap9L3DmaHack now == GAP9L3Dma() directly (async). - BlockingDmaFromAsyncDmaAdapter import removed from L3Dma.py. - The two PULPL3Tiling call sites in GAP9(Cluster)Transformer collapse from PULPL3Tiling(..., gap9L3DmaHack, dbDma = GAP9L3Dma()) to plain PULPL3Tiling(..., gap9L3DmaHack) — SB and DB share the same async DMA, no more split. The now-unused GAP9ClusterBlockingDBTransformer (routing CHW convs through a blocking DB hop) is left alone in this commit for reviewability and dropped in the next one — that way, if some CI job trips on this change, only this commit needs to be reverted, not the transformer deletion + routing swap on top. * cleanup(gap9): delete GAP9ClusterBlockingDBTransformer, reroute CHW convs GAP9ClusterBlockingDBTransformer existed solely to route CHW forward convs through a fully blocking L3 hop, because the per-channel decomposition under an async DB hop reused the shared per-tensor request handle -> UDMA overrun -> hang in im2col_conv2d_fw_kernel. With the Anydim adapter now serializing decomposed sub-transfers on the shared handle (port of pulp-platform/Deeploy#198, two commits back), the async L3 hop is safe for CHW convs too. - Delete GAP9ClusterBlockingDBTransformer entirely. - Reroute GAP9FloatConv2DCHWBindings and GAP9FloatDWConv2DCHWBindings from GAP9ClusterBlockingDBTransformer to plain GAP9ClusterTransformer, so CHW convs now get the same async L3 DB overlap as HWC convs. Expected on MobileNetV1 training DB: recovers the ~3% CHW-DB perf that the parent revert of PR #39 gave up, and pulls even with (or ahead of) the MultiReq pool approach — but through the simpler ~7-line adapter fix instead of ~90 lines of pool + stamping infrastructure. * cleanup(pulpopen): drop l3DmaBlocking, PULPL3Tiling now single async DMA Same story as the previous GAP9 commit, on the PULPOpen (Siracusa) side. The l3DmaBlocking variable existed only to give SB hops a blocking L3 DMA because 'strided-2D deferred-wait corruption/deadlock risk' — that risk is the same Anydim decomposition bug the adapter fix (port of pulp-platform/Deeploy#198) now covers. - Delete l3DmaBlocking + drop BlockingDmaFromAsyncDmaAdapter import from PULPOpen/DMA/L3Dma.py. - Both PULPL3Tiling call sites in PULPOpen/Bindings.py (Transformer, ClusterTransformer) collapse to PULPL3Tiling('L3', 'L2', l3DmaHack) — SB and DB share the same async DMA. Siracusa L3 SB tests (ResNet8/MobileNetV1/CCT training + inference) were the only consumers of the old blocking path. They now run through the async DMA + the serialize-in-decomposition adapter fix. If any of them regress in CI, drop this commit and the preceding cleanup(gap9) commits in reverse order until the failure clears — the adapter fix commit itself should be kept regardless since it is required by the rest. * style: drop unused GAP9L3Dma import from Bindings.py Left over from the previous cleanup commit which removed the last GAP9L3Dma() call site (the SB/DB split's dbDma argument). Only gap9L3DmaHack is still imported from L3Dma. Fixes: pre-commit / autoflake (Lint & Licenses CI). * chore: apply pre-commit auto-fixes for pre-existing drift on devel devel is already lint-red on these two files; pre-commit --all-files on CI catches them for every PR. Applying yapf + clang-format outputs here so this PR passes Lint & Licenses without waiting for a separate housekeeping PR. Both fixes are format-only: * MemoryAllocation.py: yapf drops one duplicated space in a keyword arg. * deeploytraintest.c: clang-format re-wraps two comment blocks (SLAVESTACKSIZE and stack-placement explanation) at a shorter width. No functional change.

change gap9 L3 dma to async

387745f

runwangdl requested review from Victor-Jung and Xeratec as code owners June 29, 2026 21:25

runwangdl added 2 commits June 29, 2026 22:06

fix(gap9): drop unused BlockingDmaFromAsyncDmaAdapter import

14f8a9f

After switching gap9L3DmaHack to the async GAP9L3Dma(), the blocking adapter is no longer referenced. Remove the now-unused import and the dead commented-out line to satisfy autoflake/yapf linting.

runwangdl changed the title ~~Change gap9 L3 dma to async~~ Change GAP9 L3 dma to Async Jun 29, 2026

Victor-Jung requested changes Jul 1, 2026

View reviewed changes

runwangdl mentioned this pull request Jul 2, 2026

cleanup(gap9/pulpopen): remove L3 DMA workarounds superseded by pulp-platform/Deeploy#198 runwangdl/TrainDeeploy#43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change GAP9 L3 dma to Async#198

Change GAP9 L3 dma to Async#198
runwangdl wants to merge 3 commits into
pulp-platform:develfrom
runwangdl:asyncdma

runwangdl commented Jun 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Victor-Jung left a comment

Uh oh!

Victor-Jung Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

runwangdl commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Victor-Jung left a comment

Choose a reason for hiding this comment

Uh oh!

Victor-Jung Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

runwangdl commented Jun 29, 2026 •

edited

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading