Skip to content

Change GAP9 L3 dma to Async#198

Open
runwangdl wants to merge 3 commits into
pulp-platform:develfrom
runwangdl:asyncdma
Open

Change GAP9 L3 dma to Async#198
runwangdl wants to merge 3 commits into
pulp-platform:develfrom
runwangdl:asyncdma

Conversation

@runwangdl

@runwangdl runwangdl commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Switch the GAP9 L3 (HyperRAM) DMA from the blocking adapter to the native asynchronous GAP9L3Dma, so the L3↔L2 hop overlaps with cluster compute under double-buffering instead of stalling on every tile transfer.

AnydimAsyncDmaTransferAdapter decomposes a rank-N transfer into a loop of rank-K
sub-transfers, and reuses one per-tensor Future — i.e. one pi_cl_ram_req_t struct
— across every iteration:
pi_cl_ram_req_t tensor_input = {0}; // one struct, shared by all iterations

  for (int i = 0; i < N; i++) {
      void*    dst_i = local    + i * loc_stride;
      uint32_t src_i = external + i * ext_stride;
      pi_cl_ram_copy_2d(get_ram_ptr(),
                        src_i, dst_i, size, stride, length, ext2loc,
                        &tensor_input);       // fires async, returns immediately
  }
  pi_cl_ram_copy_wait(&tensor_input);         // one wait after the whole loop

The src / dst / size / stride / length arguments do differ per iteration, but
pi_cl_ram_req_t is not a passive argument container. PMSIS uses it as the
driver's own queue node:

  struct pi_cl_ram_req_s {
      void            *addr;         // transfer params
      uint32_t         pi_ram_addr;  //   (overwritten by every pi_cl_ram_copy_2d)
      uint32_t         size;
      int32_t          stride;
      uint32_t         length;
      pi_evt_t         event;        // driver-owned completion event
      struct pi_cl_ram_req_s *next;  // driver's linked-list link
      uint8_t          done;         // driver-owned completion flag
      // ...
  };

Every pi_cl_ram_copy_2d(&req, …) (a) stamps the parameter fields into the struct,
(b) links req->next = queue_head; queue_head = &req, so the user-owned struct
becomes part of the driver's request queue, then (c) programs µDMA and returns.

When iter-1 fires while iter-0 is still in flight, three things break at once on
the same memory:

  • iter-0's addr / pi_ram_addr / size / stride / length are overwritten with
    iter-1's values → the pending µDMA descriptor is corrupted;
  • req->next = queue_head = &req → the driver's linked list becomes
    self-referential;
  • done / event can only track one completion → the final pi_cl_ram_copy_wait
    can't distinguish which iteration finished.

Result: µDMA reads a garbled descriptor, the queue traversal wedges, and the
cluster hangs in pi_cl_ram_copy_wait — the hang observed on GAP9 CHW convs and
strided 2D ConvGrad L3 transfers.

Generated C:

  for (int i = 0; i < N; i++) {
      pi_cl_ram_copy_2d(..., &tensor_input);          // fire
      if (tensor_input.size != 0) {                   // ← inline wait
          pi_cl_ram_copy_wait(&tensor_input);
      }
  }

Added

  • Tests/Kernels/FP32/Transpose3D regression test: a small-L1 NCHW→NHWC transpose whose L3 tile becomes a rank-3 transfer and exercises the Anydim decomposition path. Hangs without the fix, passes with it.
  • GAP9 L3 double-buffer coverage: added the validated MLPerf (KeywordSpotting / ImageClassification / AnomalyDetection), Attention and Transformer networks to L3_DOUBLEBUFFER_MODELS.

Changed

  • gap9L3DmaHack now uses the asynchronous GAP9L3Dma() directly instead of wrapping it in BlockingDmaFromAsyncDmaAdapter, so the L3 double-buffer hop overlaps transfers with compute.

Fixed

  • AnydimAsyncDmaTransferAdapter.transfer: when a transfer is decomposed into a loop of lower-rank copies, wait for each sub-copy before issuing the next, so only one copy is ever in flight on the shared request handle. This prevents the async request-reuse overrun/hang on decomposed (rank > kernel-rank) transfers. Non-decomposed (rank-2) transfers are untouched and stay fully asynchronous, preserving the double-buffer overlap.
  • Dropped the now-unused BlockingDmaFromAsyncDmaAdapter import in GAP9/DMA/L3Dma.py.

Trade-off: the decomposed rank-3 sub-copies are serialized, so those specific transfers don't overlap with compute. This affects only rank-3 (e.g. channels-first) L3 tiles, which no upstream network produces — every upstream GAP9/Siracusa network uses rank-2 L3 transfers and keeps full async overlap. If overlap is ever needed for the decomposed case, a per-sub-transfer request pool can be layered on top without changing this path.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

gap9L3DmaHack now uses GAP9L3Dma() directly, and decomposed async DMA transfers now wait on the shared future after each sub-transfer. GAP9 tiled test config entries were extended for the affected models and kernels.

Changes

GAP9 DMA updates

Layer / File(s) Summary
Direct GAP9 L3 DMA export
Deeploy/Targets/GAP9/DMA/L3Dma.py
gap9L3DmaHack is assigned GAP9L3Dma() directly and the removed adapter import is dropped.
Wait after decomposed DMA issue
Deeploy/TilingExtension/AsyncDma.py, DeeployTest/test_gap9_tiled_config.py
AnydimAsyncDmaTransferAdapter.transfer appends future.wait() in the decomposition path, and GAP9 tiled test config adds model and kernel entries for the updated async-L3 cases.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

Feature

Suggested reviewers

  • Victor-Jung
  • Xeratec
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly matches the main change: switching GAP9 L3 DMA to the async path.
Description check ✅ Passed The description matches the changes, covering the async GAP9 L3 DMA switch, decomposed DMA wait fix, and added regression/test coverage.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

After switching gap9L3DmaHack to the async GAP9L3Dma(), the blocking
adapter is no longer referenced. Remove the now-unused import and the
dead commented-out line to satisfy autoflake/yapf linting.
A transfer whose rank exceeds the DMA's supported rank is decomposed by
AnydimAsyncDmaTransferAdapter into a loop of lower-rank copies that all
share the single per-tensor request handle (future). With the GAP9 L3 DMA
now asynchronous, those copies are issued back-to-back and the in-flight
ones overwrite the shared handle before it is waited -> request reuse ->
UDMA overrun, observed as a cluster hang spinning in pi_cl_ram_copy_wait.

Wait for each decomposed sub-transfer before issuing the next so only one
copy is ever in flight on the shared handle. Non-decomposed transfers
(the common rank-2 case) are untouched and stay fully asynchronous, so the
double-buffer compute/transfer overlap is preserved for them.

Add a Transpose3D regression test (small-L1 NCHW->NHWC transpose) whose L3
tile tiles into a rank-3 transfer and exercises the decomposition path; it
hangs without this fix and passes with it. Extend the GAP9 L3 double-buffer
model list with the validated MLPerf / Attention / Transformer networks.
@runwangdl runwangdl changed the title Change gap9 L3 dma to async Change GAP9 L3 dma to Async Jun 29, 2026

@Victor-Jung Victor-Jung left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, I also suspect the mchan_transfer_t struct to be accessed by the mchan_transfer_push_2d or other runtime functions while being overwritten...

Is the change to AnydimAsyncDmaTransferAdapter only affecting L3 transfers or also L2 transfers?

# transpose tiles into a rank-3 transfer that the Anydim adapter splits into a loop
# of 2D copies sharing one request handle. Hangs under a naive async L3 DMA; passes
# once the decomposition waits each sub-copy.
"Kernels/FP32/Transpose3D": [2000],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a model and should be moved into a L3_DOUBLEBUFFER_KERNELS category.

runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request Jul 2, 2026
Port of pulp-platform/Deeploy#198.

AnydimAsyncDmaTransferAdapter decomposes a transfer whose rank exceeds the
DMA's supported rank into a loop of sub-transfers (e.g. rank-3 L3 tile ->
loop of 2D pi_cl_ram_copy_2d). All the sub-transfers share the single
per-tensor request handle (future). Issued through an async DMA, the
in-flight sub-transfers overwrite the handle before it is waited -> request
reuse -> UDMA overrun -> hang in the downstream kernel (observed on GAP9 as
im2col_conv2d_fw_kernel hang under async L3 double-buffer for CHW convs).

Fix at the adapter level: emit a future.wait() after each decomposed
sub-transfer so only one sub-copy is ever in flight on the shared handle.
Only serializes the decomposition; non-decomposed (rank <= kernelRank)
transfers stay fully async. This makes async L3 DMA correct for CHW convs
and strided ConvGrad transfers without needing the per-target blocking
wrappers or per-target multi-request pools.

Once pulp-platform/Deeploy#198 merges and we rebase upstream, this commit
can be dropped in favour of the upstream version.
runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request Jul 2, 2026
…c DMA

With the Anydim decomposition fix in place (see previous commit, port of
pulp-platform/Deeploy#198), the GAP9 L3 async DMA is safe for every
transfer shape, including CHW conv per-channel decompositions and strided
2D ConvGrad. The BlockingDmaFromAsyncDmaAdapter wrapper around
gap9L3DmaHack was there only to sidestep the Anydim bug; drop it.

- gap9L3DmaHack now == GAP9L3Dma() directly (async).
- BlockingDmaFromAsyncDmaAdapter import removed from L3Dma.py.
- The two PULPL3Tiling call sites in GAP9(Cluster)Transformer collapse
  from PULPL3Tiling(..., gap9L3DmaHack, dbDma = GAP9L3Dma()) to plain
  PULPL3Tiling(..., gap9L3DmaHack) — SB and DB share the same async DMA,
  no more split.

The now-unused GAP9ClusterBlockingDBTransformer (routing CHW convs
through a blocking DB hop) is left alone in this commit for reviewability
and dropped in the next one — that way, if some CI job trips on this
change, only this commit needs to be reverted, not the transformer
deletion + routing swap on top.
runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request Jul 2, 2026
…onvs

GAP9ClusterBlockingDBTransformer existed solely to route CHW forward convs
through a fully blocking L3 hop, because the per-channel decomposition
under an async DB hop reused the shared per-tensor request handle -> UDMA
overrun -> hang in im2col_conv2d_fw_kernel. With the Anydim adapter now
serializing decomposed sub-transfers on the shared handle (port of
pulp-platform/Deeploy#198, two commits back), the async L3 hop is safe
for CHW convs too.

- Delete GAP9ClusterBlockingDBTransformer entirely.
- Reroute GAP9FloatConv2DCHWBindings and GAP9FloatDWConv2DCHWBindings from
  GAP9ClusterBlockingDBTransformer to plain GAP9ClusterTransformer, so CHW
  convs now get the same async L3 DB overlap as HWC convs.

Expected on MobileNetV1 training DB: recovers the ~3% CHW-DB perf that
the parent revert of PR #39 gave up, and pulls even with (or ahead of) the
MultiReq pool approach — but through the simpler ~7-line adapter fix
instead of ~90 lines of pool + stamping infrastructure.
runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request Jul 2, 2026
Same story as the previous GAP9 commit, on the PULPOpen (Siracusa) side.
The l3DmaBlocking variable existed only to give SB hops a blocking L3 DMA
because 'strided-2D deferred-wait corruption/deadlock risk' — that risk is
the same Anydim decomposition bug the adapter fix (port of
pulp-platform/Deeploy#198) now covers.

- Delete l3DmaBlocking + drop BlockingDmaFromAsyncDmaAdapter import from
  PULPOpen/DMA/L3Dma.py.
- Both PULPL3Tiling call sites in PULPOpen/Bindings.py (Transformer,
  ClusterTransformer) collapse to PULPL3Tiling('L3', 'L2', l3DmaHack) —
  SB and DB share the same async DMA.

Siracusa L3 SB tests (ResNet8/MobileNetV1/CCT training + inference) were
the only consumers of the old blocking path. They now run through the
async DMA + the serialize-in-decomposition adapter fix. If any of them
regress in CI, drop this commit and the preceding cleanup(gap9) commits
in reverse order until the failure clears — the adapter fix commit itself
should be kept regardless since it is required by the rest.
runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request Jul 2, 2026
…platform/Deeploy#198 (#43)

* revert(gap9): drop MultiReq L3 DMA — superseded by pulp-platform/Deeploy#198

Reverts c799815 (PR #39, "multi-request async L3 DMA — unlock double-
buffering for CHW convs").

Upstream pulp-platform/Deeploy#198 ("Change GAP9 L3 dma to Async") fixes
the same underlying bug in seven lines:

  * root cause (both PRs agree):
    AnydimAsyncDmaTransferAdapter decomposes a rank>2 L3 transfer into a
    loop of 2D pi_cl_ram_copy_2d sub-copies that all share a single
    pi_cl_ram_req_t handle. Under async double-buffering the in-flight
    sub-copies overwrite the handle before it is waited -> UDMA overrun
    -> hang in im2col_conv2d_fw_kernel.

  * #198's fix: append future.wait() after each decomposed sub-copy
    inside AnydimAsyncDmaTransferAdapter.transfer. Serializes only the
    decomposition on the shared handle; non-decomposed transfers stay
    fully async. gap9L3DmaHack drops the BlockingDmaFromAsyncDmaAdapter
    and becomes plain GAP9L3Dma() -- correct for every path, including
    CHW convs, without any codegen or binding changes.

  * PR #39's fix (this revert): a per-tensor pool of pi_cl_ram_req_t
    sized to the exact decomposition count, plus new stamping fields
    (_maxReq / _poolSize / _reqIndexExpr) on futures, plus a dedicated
    GAP9ClusterBlockingDBTransformer routing CHW convs through the pool.
    ~90 lines of codegen + a new transformer + a new binding path.

Both fixes are correct. On GAP9's single-channel HyperRAM the "pool of
requests" concurrency is not realizable in hardware -- sub-copies still
queue on the one DMA engine -- so the measured MobileNetV1 DB win in
PR #39 (264.7M -> 256.2M cyc, -3.2%) is the "blocking -> async" gain,
which #198 achieves the simpler way.

Cost of this revert until upstream #198 lands and we rebase:

  * MobileNetV1 training L3 DB regresses ~3% (CHW conv's dbDma falls
    back to the blocking gap9L3DmaHack, same state as before PR #39).
  * ResNet8 (HWC) and CCT L3 DB: unaffected (never used MultiReq).
  * All models remain functionally correct.

Once #198 is merged, a follow-up will also collapse the SB=blocking /
DB=async split in GAP9Transformer / GAP9ClusterTransformer (currently
justified by "strided 2D ConvGrad L3 transfers"), and delete the whole
GAP9ClusterBlockingDBTransformer variant + its routing.

* fix(dma): serialize Anydim decomposition on shared future handle

Port of pulp-platform/Deeploy#198.

AnydimAsyncDmaTransferAdapter decomposes a transfer whose rank exceeds the
DMA's supported rank into a loop of sub-transfers (e.g. rank-3 L3 tile ->
loop of 2D pi_cl_ram_copy_2d). All the sub-transfers share the single
per-tensor request handle (future). Issued through an async DMA, the
in-flight sub-transfers overwrite the handle before it is waited -> request
reuse -> UDMA overrun -> hang in the downstream kernel (observed on GAP9 as
im2col_conv2d_fw_kernel hang under async L3 double-buffer for CHW convs).

Fix at the adapter level: emit a future.wait() after each decomposed
sub-transfer so only one sub-copy is ever in flight on the shared handle.
Only serializes the decomposition; non-decomposed (rank <= kernelRank)
transfers stay fully async. This makes async L3 DMA correct for CHW convs
and strided ConvGrad transfers without needing the per-target blocking
wrappers or per-target multi-request pools.

Once pulp-platform/Deeploy#198 merges and we rebase upstream, this commit
can be dropped in favour of the upstream version.

* cleanup(gap9): drop L3 blocking wrapper, collapse SB/DB into one async DMA

With the Anydim decomposition fix in place (see previous commit, port of
pulp-platform/Deeploy#198), the GAP9 L3 async DMA is safe for every
transfer shape, including CHW conv per-channel decompositions and strided
2D ConvGrad. The BlockingDmaFromAsyncDmaAdapter wrapper around
gap9L3DmaHack was there only to sidestep the Anydim bug; drop it.

- gap9L3DmaHack now == GAP9L3Dma() directly (async).
- BlockingDmaFromAsyncDmaAdapter import removed from L3Dma.py.
- The two PULPL3Tiling call sites in GAP9(Cluster)Transformer collapse
  from PULPL3Tiling(..., gap9L3DmaHack, dbDma = GAP9L3Dma()) to plain
  PULPL3Tiling(..., gap9L3DmaHack) — SB and DB share the same async DMA,
  no more split.

The now-unused GAP9ClusterBlockingDBTransformer (routing CHW convs
through a blocking DB hop) is left alone in this commit for reviewability
and dropped in the next one — that way, if some CI job trips on this
change, only this commit needs to be reverted, not the transformer
deletion + routing swap on top.

* cleanup(gap9): delete GAP9ClusterBlockingDBTransformer, reroute CHW convs

GAP9ClusterBlockingDBTransformer existed solely to route CHW forward convs
through a fully blocking L3 hop, because the per-channel decomposition
under an async DB hop reused the shared per-tensor request handle -> UDMA
overrun -> hang in im2col_conv2d_fw_kernel. With the Anydim adapter now
serializing decomposed sub-transfers on the shared handle (port of
pulp-platform/Deeploy#198, two commits back), the async L3 hop is safe
for CHW convs too.

- Delete GAP9ClusterBlockingDBTransformer entirely.
- Reroute GAP9FloatConv2DCHWBindings and GAP9FloatDWConv2DCHWBindings from
  GAP9ClusterBlockingDBTransformer to plain GAP9ClusterTransformer, so CHW
  convs now get the same async L3 DB overlap as HWC convs.

Expected on MobileNetV1 training DB: recovers the ~3% CHW-DB perf that
the parent revert of PR #39 gave up, and pulls even with (or ahead of) the
MultiReq pool approach — but through the simpler ~7-line adapter fix
instead of ~90 lines of pool + stamping infrastructure.

* cleanup(pulpopen): drop l3DmaBlocking, PULPL3Tiling now single async DMA

Same story as the previous GAP9 commit, on the PULPOpen (Siracusa) side.
The l3DmaBlocking variable existed only to give SB hops a blocking L3 DMA
because 'strided-2D deferred-wait corruption/deadlock risk' — that risk is
the same Anydim decomposition bug the adapter fix (port of
pulp-platform/Deeploy#198) now covers.

- Delete l3DmaBlocking + drop BlockingDmaFromAsyncDmaAdapter import from
  PULPOpen/DMA/L3Dma.py.
- Both PULPL3Tiling call sites in PULPOpen/Bindings.py (Transformer,
  ClusterTransformer) collapse to PULPL3Tiling('L3', 'L2', l3DmaHack) —
  SB and DB share the same async DMA.

Siracusa L3 SB tests (ResNet8/MobileNetV1/CCT training + inference) were
the only consumers of the old blocking path. They now run through the
async DMA + the serialize-in-decomposition adapter fix. If any of them
regress in CI, drop this commit and the preceding cleanup(gap9) commits
in reverse order until the failure clears — the adapter fix commit itself
should be kept regardless since it is required by the rest.

* style: drop unused GAP9L3Dma import from Bindings.py

Left over from the previous cleanup commit which removed the last
GAP9L3Dma() call site (the SB/DB split's dbDma argument). Only
gap9L3DmaHack is still imported from L3Dma.

Fixes: pre-commit / autoflake (Lint & Licenses CI).

* chore: apply pre-commit auto-fixes for pre-existing drift on devel

devel is already lint-red on these two files; pre-commit --all-files on
CI catches them for every PR. Applying yapf + clang-format outputs here
so this PR passes Lint & Licenses without waiting for a separate housekeeping PR.

Both fixes are format-only:
  * MemoryAllocation.py: yapf drops one duplicated space in a keyword arg.
  * deeploytraintest.c: clang-format re-wraps two comment blocks
    (SLAVESTACKSIZE and stack-placement explanation) at a shorter width.

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants