[DLight] Add CPU Reduction schedule rule for softmax-like operators by swjng · Pull Request #19374 · apache/tvm

swjng · 2026-04-09T07:22:18Z

This PR resolves part of #18569

Description

This PR adds a DLight CPU schedule rule targeting reduction patterns
(softmax, layer norm, RMS norm) that previously had no CPU-specific
schedule.

Without this rule, LLVM auto-vectorization produces suboptimal code for
RVV targets — #18569 reports RVV softmax is 1.34x slower than scalar.
The root cause is:

LLVM fully unrolls the 185-element reduction loop
Generates harmful fixed-width vectors (<8 x float> = 256-bit) on a
128-bit VLEN target
No parallelization of the batch axis

Schedule strategy

Parallelize leading spatial axes (batch dimension)
Compute-at all blocks under the spatial loop for data locality
Vectorize injective blocks (exp, delta, norm) on inner axis
Split + unroll reduction inner axis to VLEN-sized chunks

Results (shape=(14, 185), float32, `fast_softmax`)

Config	ASM Instructions	Vector Insns	LLVM IR Lines
RV scalar (baseline)	1,463	0	1,792
RVV unscheduled (the bug)	3,282	960	2,345
RVV + this schedule	1,111	105	1,338

66% fewer instructions vs unscheduled RVV
24% fewer instructions vs scalar baseline
fast_softmax polynomial exp fully vectorizes into RVV vfsub/vfmul/vfmax instructions with zero scalar exp() calls

Limitations (follow-up work)

Reduction blocks (max, sum) use split+unroll rather than vectorized
partial reduction via rfactor, because TVM's rfactor primitive
requires the reduction block to be the first child of its enclosing
loop — incompatible with compute_at when multiple blocks share one
spatial loop. A follow-up PR will register RVV reduction intrinsics
(vfredmax/vfredusum) and use tensorize to vectorize reductions.

Testing

pytest tests/python/s_tir/dlight/test_cpu_reduction.py -v

14 shape x operator applicability tests
2 TIR structure verification tests
4 RVV codegen quality tests (code size, exp vectorization, instruction count)
Existing test_cpu_gemv.py unaffected (10/10 pass)

gemini-code-assist

Code Review

This pull request introduces a new CPU reduction schedule rule for DLight, specifically targeting operators like softmax, layer norm, and RMS norm. The implementation focuses on parallelizing leading spatial axes, vectorizing injective blocks, and applying split-and-unroll strategies to reduction blocks to optimize LLVM backend performance, particularly for RISC-V Vector (RVV) targets. Feedback suggests improving the rule by dynamically inferring data type bit widths to support half-precision formats and enabling support for dynamic shapes by removing restrictive integer checks on loop extents.

python/tvm/s_tir/dlight/cpu/reduction.py

Add a DLight schedule rule targeting CPU reduction patterns (softmax, layer norm, RMS norm) that previously had no CPU-specific schedule, causing LLVM auto-vectorization to produce suboptimal code. This addresses apache#18569 where RVV softmax is 1.34x slower than scalar due to: - Excessive loop unrolling (2345 -> 1193 LLVM IR lines, -49%) - Harmful fixed-width vector usage on scalable-vector targets - No parallelization of the batch axis The rule applies the following schedule: 1. Parallelize leading spatial axes (batch dimension) 2. Compute all blocks under the spatial loop for locality 3. Vectorize injective blocks (exp, norm) on inner axis 4. Split reduction inner axis to VLEN-sized chunks with unroll annotation to guide LLVM codegen Assembly instruction count comparison (shape=(14,185), fast_softmax): - RV scalar baseline: 1463 instructions - RVV unscheduled: 3282 instructions (2.2x bloat, the bug) - RVV with schedule: 1111 instructions (-66% vs unscheduled) Tested with softmax and fast_softmax across 7 shapes from (1,10) to (1,30522).

swjng · 2026-04-09T07:45:27Z

Follow-up plan: Vectorized reduction via RVV intrinsics

This PR improves loop structure and vectorizes injective blocks, but reduction blocks (max, sum) still use scalar accumulation with split+unroll.

If this PR is okay to be merged, I would submit a follow-up PR that vectorizes the reductions themselves:

Register RVV reduction intrinsics in s_tir/tensor_intrin/riscv_cpu.py:
- vfredmax.vs (vector float max reduction)
- vfredusum.vs (vector float unordered sum reduction)
Use tensorize in the Reduction schedule rule to map reduction blocks to these intrinsics,
bypassing the rfactor first-child constraint discovered during this PR's development.

tlopex

A few concerns comments:

The exception handling around llvm_get_vector_width is too broad. The docstring says it returns -1 on failure, but that case is not handled here, and except Exception may silently hide unrelated issues.
num_leading_s is inferred from only the first reduction block. The current check does not verify that all relevant blocks have the same number of leading spatial axes, which could make the later scheduling invalid.
The last block is vectorized in Phase 2 before Phase 3 applies compute_at. Please confirm that this transformation order is safe and does not invalidate the earlier vectorization.

swjng · 2026-04-10T06:51:22Z

@tlopex Thanks for the comment! All three points addressed in the latest push.

Exception handling: Replaced except Exception with a return-value check (<= 0). Verified that llvm_get_vector_width returns -1 for non-LLVM targets and never throws for valid LLVM targets.
Leading spatial axes: Now computed as min(leading_S) across ALL blocks rather than reading from just the first reduction block. In softmax-like patterns, reduction blocks have dom_kind=SR (1 leading S) while injective blocks have dom_kind=SS (2 leading S). Taking the minimum ensures the fuse/parallel scope is valid for every block that will be compute_at'd into it.
Phase 2 -> 3 order: Confirmed safe. Phase 2 vectorizes only the inner loop of the last block (splitting loop[-1] into [outer, vec]), while Phase 3 compute_at moves other blocks under spatial (the outermost loop of the last block). These operate on disjoint parts of the loop nest. Verified with bit-exact correctness checks across 6 shape × 2 operator combinations.

Also extracted _vectorize_inner and _unroll_reduction_inner as static methods to reduce apply() complexity.

…atial axes - Infer dtype_bits from the last block's write buffer instead of hardcoding 32, so float16/bfloat16 get correct vector lane counts. - Support dynamic extents in split+vectorize by removing isinstance(extent, int) guards where TVM primitives handle them. - Replace broad except Exception around llvm_get_vector_width with return-value check (returns -1 on failure, not an exception). - Compute num_leading_s as min across ALL blocks, not just the first reduction block, ensuring compute_at safety. - Extract _vectorize_inner and _unroll_reduction_inner as static methods to reduce apply() complexity.

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

python/tvm/s_tir/dlight/cpu/reduction.py Outdated Show resolved Hide resolved

python/tvm/s_tir/dlight/cpu/reduction.py Outdated Show resolved Hide resolved

swjng force-pushed the feat/add-cpu-dlight-reduction-rule branch from ad2edb4 to 407b45a Compare April 9, 2026 07:31

swjng force-pushed the feat/add-cpu-dlight-reduction-rule branch from 407b45a to 3b34388 Compare April 9, 2026 07:31

tlopex requested changes Apr 10, 2026

View reviewed changes

swjng marked this pull request as draft April 10, 2026 04:39

swjng marked this pull request as ready for review April 10, 2026 06:51

swjng force-pushed the feat/add-cpu-dlight-reduction-rule branch from 4f7d06e to 4598374 Compare April 10, 2026 06:52

swjng requested a review from tlopex April 10, 2026 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DLight] Add CPU Reduction schedule rule for softmax-like operators#19374

[DLight] Add CPU Reduction schedule rule for softmax-like operators#19374
swjng wants to merge 2 commits intoapache:mainfrom
swjng:feat/add-cpu-dlight-reduction-rule

swjng commented Apr 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

swjng commented Apr 9, 2026

Uh oh!

tlopex left a comment

Uh oh!

swjng commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swjng commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Schedule strategy

Results (shape=(14, 185), float32, fast_softmax)

Limitations (follow-up work)

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

swjng commented Apr 9, 2026

Follow-up plan: Vectorized reduction via RVV intrinsics

Uh oh!

tlopex left a comment

Choose a reason for hiding this comment

Uh oh!

swjng commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swjng commented Apr 9, 2026 •

edited

Loading

Results (shape=(14, 185), float32, `fast_softmax`)