Conversation
|
Review updated until commit 881b1e1 Description
|
| Relevant files | |||||||
|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Heuristics Need Tuning
|
Greptile SummaryThis PR introduces a new TMA-based auto-scheduler for outer (non-fastest-dim) reductions, complementing the existing inner TMA scheduler from PR#5926. The scheduler uses a fixed 2D thread block (32×16), fixed 128×128 TMA tiles, and a simple heuristic for grid-level parallelism (clamped power-of-2 of Key changes:
Critical issue: Both Minor issue: Confidence Score: 2/5
Important Files Changed
Flowchartflowchart TD
A[ReductionScheduler::computeHeuristics] --> B{tma_enabled?}
B -- No --> F[non_tma::getReductionHeuristics]
B -- Yes --> C{mayUseTmaOuter?}
C -- "No: inner reduction / small size / misaligned / multi-input" --> D{mayUseTma?}
C -- "Yes: outer reduction, large enough, SM>=9, single input, aligned" --> E[outer_tma::getReductionHeuristics]
D -- No --> F
D -- Yes --> G[tma::getReductionHeuristics]
E --> H[TmaOuterReductionParams]
G --> I[TmaInnerReductionParams]
F --> J[ReductionParams]
H --> K[ReductionScheduler::schedule]
I --> K
J --> K
K --> L{dynamic_cast}
L -- TmaOuterReductionParams --> M[outer_tma::scheduleReduction]
L -- TmaInnerReductionParams --> N[tma::scheduleReduction]
L -- ReductionParams --> O[non_tma::scheduleReduction]
subgraph "outer_tma::scheduleReduction"
M --> P1[Phase 1: cacheInputs - set CpAsyncBulkTensorTile]
P1 --> P2[Phase 2: TMA 2D tiling on smem TV - split R,I by tma_tile_r,tma_tile_i]
P2 --> P3[Phase 3: Propagate TMA transforms to all TVs]
P3 --> P4[Phase 4: Parallelize TMA TV - BIDy/Serial/Bulk/BIDx/Bulk]
P4 --> P5[Phase 5: Sub-split TMA tiles into TIDx,TIDy]
P5 --> P6[Phase 6: Parallelize reduction TV]
P6 --> P7[Phase 7: rFactor non-thread reduction axes for grid reduction]
P7 --> P8[Phase 8: Propagate to non-TMA TVs]
P8 --> P9[Phase 9: propagateParallelization with iter-grouped reduction]
P9 --> P10[Phase 10: inlineMost + refineCachePolicy]
end
Last reviewed commit: 881b1e1 |
| } | ||
|
|
||
| // TMA requires 16-byte alignment (128 bits) for memory transactions | ||
| if (props.vectorize_factor * props.max_dtype_size_bit % 128 != 0) { |
There was a problem hiding this comment.
max_dtype_size_bit field does not exist — compile error
reduction_scheduler_utils::FusionRuntimeProperties has no field named max_dtype_size_bit; the correct field is max_dtype_size_bit_for_vectorization. This causes a compilation failure. The intent of the check (16-byte TMA alignment) matches max_dtype_size_bit_for_vectorization.
| if (props.vectorize_factor * props.max_dtype_size_bit % 128 != 0) { | |
| if (props.vectorize_factor * props.max_dtype_size_bit_for_vectorization % 128 != 0) { |
| } | ||
|
|
||
| // TMA requires 16-byte alignment (128 bits) for memory transactions | ||
| if (props.vectorize_factor * props.max_dtype_size_bit % 128 != 0) { |
There was a problem hiding this comment.
max_dtype_size_bit field does not exist — compile error
Same issue as in mayUseTma above. reduction_scheduler_utils::FusionRuntimeProperties has no max_dtype_size_bit field. The field should be max_dtype_size_bit_for_vectorization.
| if (props.vectorize_factor * props.max_dtype_size_bit % 128 != 0) { | |
| if (props.vectorize_factor * props.max_dtype_size_bit_for_vectorization % 128 != 0) { |
| const int64_t redu_unroll_factor = tma_tile_r / bdimy; | ||
|
|
||
| // Grid dimension for parallelizing the outer reduction across CTAs. | ||
| // Modeled after the manual test: clamp lastPow2(outer_size / 256) to [1, 8]. | ||
| const int64_t outer_size = props.total_reduction_numel; | ||
| int64_t grdim = std::max<int64_t>( | ||
| 1, std::min<int64_t>(8, scheduler_utils::lastPow2(outer_size / 256))); | ||
|
|
||
| auto params = std::make_unique<TmaOuterReductionParams>(); | ||
| params->bdimx = bdimx; | ||
| params->bdimy = bdimy; | ||
| params->tma_tile_i = tma_tile_i; | ||
| params->tma_tile_r = tma_tile_r; | ||
| params->iter_unroll_factor = iter_unroll_factor; | ||
| params->redu_unroll_factor = redu_unroll_factor; | ||
| params->grdim = grdim; |
There was a problem hiding this comment.
redu_unroll_factor computed but never used in scheduleReduction
redu_unroll_factor is computed in getReductionHeuristics and stored in params->redu_unroll_factor, but scheduleReduction never reads rparams->redu_unroll_factor. Instead, scheduleReduction re-derives the same split factor implicitly via redu_tv->split(2, bdimy) (giving tma_tile_r / bdimy as the outer = unroll count). In contrast, iter_unroll_factor is explicitly re-derived and used (redu_tv->split(4, iter_unroll_factor)). Consider either removing redu_unroll_factor from the params struct and the heuristics computation, or making scheduleReduction read rparams->redu_unroll_factor explicitly for consistency.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Add auto-scheduler for TMA outer reduction. Similar schedule to PR#5926. Fixed block dims, TMA tile sizes, and unroll factor. Always target grid reduction with a very simple heuristic.