GAP9: -O3 hot kernels + hoist L2->L1 tile-control tables to L2#199
Open
runwangdl wants to merge 2 commits into
Open
GAP9: -O3 hot kernels + hoist L2->L1 tile-control tables to L2#199runwangdl wants to merge 2 commits into
runwangdl wants to merge 2 commits into
Conversation
Compile the conv / depthwise-conv / Gemm translation units at -O3, appended last so it wins over the SDK's default -Os on the same files. Everything else stays at -Os. -O3 turns on the RISC-V (XpulpV2) hardware loops on the kernels' tight inner loops; on a forward conv the -O3 object has 18 lp.setup HW-loop instructions vs 0 at -Os, at the cost of ~+50% .text on those files.
The hoisted tile-control tables (numTiles / DMA cmd / size / dims / padding / offsets) are read-only lookup tables the cluster controller uses to drive the tiling loop and program DMAs -- not bulk tile data. Previously the L2->L1 tiling pass emitted them with _memoryLevel=self.memory="L1", so they landed in the GAP9 L1 TCDM next to the cluster master stack. On memory-tight nets this both wastes scarce L1 (~11.6 KB on CCT, ~7.0 KB on MobileNetV1, ~2.5 KB on ResNet8) and creates a correctness hazard: a deep master-stack write can clobber a single table entry, turning a DMA cmd into a garbage code pointer so mchan_transfer_wait() hangs forever (observed on MobileNetV1 training). Redirect only the L2->L1 pass to emit these tables in L2. The L3->L2 pass keeps its tables in L2 (== self.memory, unchanged). Platforms that don't tile into a level named "L1" are unaffected. Tile *data* buffers still go to L1 as before -- only the constant control tables move.
Contributor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
👮 Files not reviewed due to content moderation or server errors (2)
📝 Walkthrough
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix
1.
-O3on the hot forward kernelsTargetLibraries/GAP9/CMakeLists.txt— compile Conv / DWConv / Gemm at-O3, appended last so it wins over the SDK's default-Os. This turns on the RISC-V (XpulpV2) hardware loops on the tight inner loops.2. Hoist L2→L1 tile-control tables to L2
Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py::_hoistValues— one-line change:The hoisted per-tile control tables (
numTiles, DMAcmd,size,dim_im_*,padding_*, offsets) are read-only lookup tables the cluster controller uses to drive the tiling loop and program DMAs. Previously the L2→L1 pass emitted them with_memoryLevel = self.memory = "L1", landing them in GAP9 L1 TCDM next to the cluster master stack.L1 freed (FP32 training): ResNet8 ~2.5 KB · MobileNetV1 ~7.0 KB · CCT ~11.6 KB.