Skip to content

GAP9: -O3 hot kernels + hoist L2->L1 tile-control tables to L2#199

Open
runwangdl wants to merge 2 commits into
pulp-platform:develfrom
runwangdl:fix-gap9-l3-board
Open

GAP9: -O3 hot kernels + hoist L2->L1 tile-control tables to L2#199
runwangdl wants to merge 2 commits into
pulp-platform:develfrom
runwangdl:fix-gap9-l3-board

Conversation

@runwangdl

@runwangdl runwangdl commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Fix

1. -O3 on the hot forward kernels

TargetLibraries/GAP9/CMakeLists.txt — compile Conv / DWConv / Gemm at -O3, appended last so it wins over the SDK's default -Os. This turns on the RISC-V (XpulpV2) hardware loops on the tight inner loops.

2. Hoist L2→L1 tile-control tables to L2

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py::_hoistValues — one-line change:

cb._memoryLevel = "L2" if self.memory == "L1" else self.memory

The hoisted per-tile control tables (numTiles, DMA cmd, size, dim_im_*, padding_*, offsets) are read-only lookup tables the cluster controller uses to drive the tiling loop and program DMAs. Previously the L2→L1 pass emitted them with _memoryLevel = self.memory = "L1", landing them in GAP9 L1 TCDM next to the cluster master stack.

L1 freed (FP32 training): ResNet8 ~2.5 KB · MobileNetV1 ~7.0 KB · CCT ~11.6 KB.

runwangdl added 2 commits July 2, 2026 12:08
Compile the conv / depthwise-conv / Gemm translation units at -O3, appended
last so it wins over the SDK's default -Os on the same files. Everything else
stays at -Os. -O3 turns on the RISC-V (XpulpV2) hardware loops on the kernels'
tight inner loops; on a forward conv the -O3 object has 18 lp.setup HW-loop
instructions vs 0 at -Os, at the cost of ~+50% .text on those files.
The hoisted tile-control tables (numTiles / DMA cmd / size / dims / padding /
offsets) are read-only lookup tables the cluster controller uses to drive the
tiling loop and program DMAs -- not bulk tile data. Previously the L2->L1
tiling pass emitted them with _memoryLevel=self.memory="L1", so they landed
in the GAP9 L1 TCDM next to the cluster master stack. On memory-tight nets
this both wastes scarce L1 (~11.6 KB on CCT, ~7.0 KB on MobileNetV1, ~2.5 KB
on ResNet8) and creates a correctness hazard: a deep master-stack write can
clobber a single table entry, turning a DMA cmd into a garbage code pointer
so mchan_transfer_wait() hangs forever (observed on MobileNetV1 training).

Redirect only the L2->L1 pass to emit these tables in L2. The L3->L2 pass
keeps its tables in L2 (== self.memory, unchanged). Platforms that don't
tile into a level named "L1" are unaffected. Tile *data* buffers still go
to L1 as before -- only the constant control tables move.
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 02489f3c-14ff-43b3-a4d9-c4d7ac627956

📥 Commits

Reviewing files that changed from the base of the PR and between 6818f31 and 1e91ff7.

📒 Files selected for processing (2)
  • Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py
  • TargetLibraries/GAP9/CMakeLists.txt
👮 Files not reviewed due to content moderation or server errors (2)
  • TargetLibraries/GAP9/CMakeLists.txt
  • Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py

📝 Walkthrough

[!WARNING]

Walkthrough skipped

File diffs could not be summarized.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant