[DLight] Add CPU Reduction schedule rule for softmax-like operators#19374
[DLight] Add CPU Reduction schedule rule for softmax-like operators#19374swjng wants to merge 2 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new CPU reduction schedule rule for DLight, specifically targeting operators like softmax, layer norm, and RMS norm. The implementation focuses on parallelizing leading spatial axes, vectorizing injective blocks, and applying split-and-unroll strategies to reduction blocks to optimize LLVM backend performance, particularly for RISC-V Vector (RVV) targets. Feedback suggests improving the rule by dynamically inferring data type bit widths to support half-precision formats and enabling support for dynamic shapes by removing restrictive integer checks on loop extents.
ad2edb4 to
407b45a
Compare
Add a DLight schedule rule targeting CPU reduction patterns (softmax, layer norm, RMS norm) that previously had no CPU-specific schedule, causing LLVM auto-vectorization to produce suboptimal code. This addresses apache#18569 where RVV softmax is 1.34x slower than scalar due to: - Excessive loop unrolling (2345 -> 1193 LLVM IR lines, -49%) - Harmful fixed-width vector usage on scalable-vector targets - No parallelization of the batch axis The rule applies the following schedule: 1. Parallelize leading spatial axes (batch dimension) 2. Compute all blocks under the spatial loop for locality 3. Vectorize injective blocks (exp, norm) on inner axis 4. Split reduction inner axis to VLEN-sized chunks with unroll annotation to guide LLVM codegen Assembly instruction count comparison (shape=(14,185), fast_softmax): - RV scalar baseline: 1463 instructions - RVV unscheduled: 3282 instructions (2.2x bloat, the bug) - RVV with schedule: 1111 instructions (-66% vs unscheduled) Tested with softmax and fast_softmax across 7 shapes from (1,10) to (1,30522).
407b45a to
3b34388
Compare
Follow-up plan: Vectorized reduction via RVV intrinsicsThis PR improves loop structure and vectorizes injective blocks, but reduction blocks (max, sum) still use scalar accumulation with split+unroll. If this PR is okay to be merged, I would submit a follow-up PR that vectorizes the reductions themselves:
|
tlopex
left a comment
There was a problem hiding this comment.
A few concerns comments:
-
The exception handling around
llvm_get_vector_widthis too broad. The docstring says it returns-1on failure, but that case is not handled here, andexcept Exceptionmay silently hide unrelated issues. -
num_leading_sis inferred from only the first reduction block. The current check does not verify that all relevant blocks have the same number of leading spatial axes, which could make the later scheduling invalid. -
The last block is vectorized in Phase 2 before Phase 3 applies
compute_at. Please confirm that this transformation order is safe and does not invalidate the earlier vectorization.
|
@tlopex Thanks for the comment! All three points addressed in the latest push.
Also extracted |
…atial axes - Infer dtype_bits from the last block's write buffer instead of hardcoding 32, so float16/bfloat16 get correct vector lane counts. - Support dynamic extents in split+vectorize by removing isinstance(extent, int) guards where TVM primitives handle them. - Replace broad except Exception around llvm_get_vector_width with return-value check (returns -1 on failure, not an exception). - Compute num_leading_s as min across ALL blocks, not just the first reduction block, ensuring compute_at safety. - Extract _vectorize_inner and _unroll_reduction_inner as static methods to reduce apply() complexity.
4f7d06e to
4598374
Compare
This PR resolves part of #18569
Description
This PR adds a DLight CPU schedule rule targeting reduction patterns
(softmax, layer norm, RMS norm) that previously had no CPU-specific
schedule.
Without this rule, LLVM auto-vectorization produces suboptimal code for
RVV targets — #18569 reports RVV softmax is 1.34x slower than scalar.
The root cause is:
<8 x float>= 256-bit) on a128-bit VLEN target
Schedule strategy
Results (shape=(14, 185), float32,
fast_softmax)fast_softmaxpolynomial exp fully vectorizes into RVVvfsub/vfmul/vfmaxinstructions with zero scalarexp()callsLimitations (follow-up work)
partial reduction via
rfactor, because TVM'srfactorprimitiverequires the reduction block to be the first child of its enclosing
loop — incompatible with
compute_atwhen multiple blocks share onespatial loop. A follow-up PR will register RVV reduction intrinsics
(
vfredmax/vfredusum) and usetensorizeto vectorize reductions.Testing
test_cpu_gemv.pyunaffected (10/10 pass)