Skip to content

[DLight] Add CPU Reduction schedule rule for softmax-like operators#19374

Open
swjng wants to merge 2 commits intoapache:mainfrom
swjng:feat/add-cpu-dlight-reduction-rule
Open

[DLight] Add CPU Reduction schedule rule for softmax-like operators#19374
swjng wants to merge 2 commits intoapache:mainfrom
swjng:feat/add-cpu-dlight-reduction-rule

Conversation

@swjng
Copy link
Copy Markdown
Contributor

@swjng swjng commented Apr 9, 2026

This PR resolves part of #18569

Description

This PR adds a DLight CPU schedule rule targeting reduction patterns
(softmax, layer norm, RMS norm) that previously had no CPU-specific
schedule.

Without this rule, LLVM auto-vectorization produces suboptimal code for
RVV targets — #18569 reports RVV softmax is 1.34x slower than scalar.
The root cause is:

  • LLVM fully unrolls the 185-element reduction loop
  • Generates harmful fixed-width vectors (<8 x float> = 256-bit) on a
    128-bit VLEN target
  • No parallelization of the batch axis

Schedule strategy

  1. Parallelize leading spatial axes (batch dimension)
  2. Compute-at all blocks under the spatial loop for data locality
  3. Vectorize injective blocks (exp, delta, norm) on inner axis
  4. Split + unroll reduction inner axis to VLEN-sized chunks

Results (shape=(14, 185), float32, fast_softmax)

Config ASM Instructions Vector Insns LLVM IR Lines
RV scalar (baseline) 1,463 0 1,792
RVV unscheduled (the bug) 3,282 960 2,345
RVV + this schedule 1,111 105 1,338
  • 66% fewer instructions vs unscheduled RVV
  • 24% fewer instructions vs scalar baseline
  • fast_softmax polynomial exp fully vectorizes into RVV vfsub/vfmul/vfmax instructions with zero scalar exp() calls

Limitations (follow-up work)

  • Reduction blocks (max, sum) use split+unroll rather than vectorized
    partial reduction via rfactor, because TVM's rfactor primitive
    requires the reduction block to be the first child of its enclosing
    loop — incompatible with compute_at when multiple blocks share one
    spatial loop. A follow-up PR will register RVV reduction intrinsics
    (vfredmax/vfredusum) and use tensorize to vectorize reductions.

Testing

pytest tests/python/s_tir/dlight/test_cpu_reduction.py -v
  • 14 shape x operator applicability tests
  • 2 TIR structure verification tests
  • 4 RVV codegen quality tests (code size, exp vectorization, instruction count)
  • Existing test_cpu_gemv.py unaffected (10/10 pass)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new CPU reduction schedule rule for DLight, specifically targeting operators like softmax, layer norm, and RMS norm. The implementation focuses on parallelizing leading spatial axes, vectorizing injective blocks, and applying split-and-unroll strategies to reduction blocks to optimize LLVM backend performance, particularly for RISC-V Vector (RVV) targets. Feedback suggests improving the rule by dynamically inferring data type bit widths to support half-precision formats and enabling support for dynamic shapes by removing restrictive integer checks on loop extents.

@swjng swjng force-pushed the feat/add-cpu-dlight-reduction-rule branch from ad2edb4 to 407b45a Compare April 9, 2026 07:31
Add a DLight schedule rule targeting CPU reduction patterns
(softmax, layer norm, RMS norm) that previously had no CPU-specific
schedule, causing LLVM auto-vectorization to produce suboptimal code.

This addresses apache#18569 where RVV softmax is 1.34x slower
than scalar due to:
- Excessive loop unrolling (2345 -> 1193 LLVM IR lines, -49%)
- Harmful fixed-width vector usage on scalable-vector targets
- No parallelization of the batch axis

The rule applies the following schedule:
1. Parallelize leading spatial axes (batch dimension)
2. Compute all blocks under the spatial loop for locality
3. Vectorize injective blocks (exp, norm) on inner axis
4. Split reduction inner axis to VLEN-sized chunks with unroll
   annotation to guide LLVM codegen

Assembly instruction count comparison (shape=(14,185), fast_softmax):
- RV scalar baseline:  1463 instructions
- RVV unscheduled:     3282 instructions (2.2x bloat, the bug)
- RVV with schedule:   1111 instructions (-66% vs unscheduled)

Tested with softmax and fast_softmax across 7 shapes from (1,10)
to (1,30522).
@swjng swjng force-pushed the feat/add-cpu-dlight-reduction-rule branch from 407b45a to 3b34388 Compare April 9, 2026 07:31
@swjng
Copy link
Copy Markdown
Contributor Author

swjng commented Apr 9, 2026

Follow-up plan: Vectorized reduction via RVV intrinsics

This PR improves loop structure and vectorizes injective blocks, but reduction blocks (max, sum) still use scalar accumulation with split+unroll.

If this PR is okay to be merged, I would submit a follow-up PR that vectorizes the reductions themselves:

  1. Register RVV reduction intrinsics in s_tir/tensor_intrin/riscv_cpu.py:

    • vfredmax.vs (vector float max reduction)
    • vfredusum.vs (vector float unordered sum reduction)
  2. Use tensorize in the Reduction schedule rule to map reduction blocks to these intrinsics,
    bypassing the rfactor first-child constraint discovered during this PR's development.

Copy link
Copy Markdown
Member

@tlopex tlopex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few concerns comments:

  1. The exception handling around llvm_get_vector_width is too broad. The docstring says it returns -1 on failure, but that case is not handled here, and except Exception may silently hide unrelated issues.

  2. num_leading_s is inferred from only the first reduction block. The current check does not verify that all relevant blocks have the same number of leading spatial axes, which could make the later scheduling invalid.

  3. The last block is vectorized in Phase 2 before Phase 3 applies compute_at. Please confirm that this transformation order is safe and does not invalidate the earlier vectorization.

@swjng swjng marked this pull request as draft April 10, 2026 04:39
@swjng
Copy link
Copy Markdown
Contributor Author

swjng commented Apr 10, 2026

@tlopex Thanks for the comment! All three points addressed in the latest push.

  1. Exception handling: Replaced except Exception with a return-value check (<= 0). Verified that llvm_get_vector_width returns -1 for non-LLVM targets and never throws for valid LLVM targets.

  2. Leading spatial axes: Now computed as min(leading_S) across ALL blocks rather than reading from just the first reduction block. In softmax-like patterns, reduction blocks have dom_kind=SR (1 leading S) while injective blocks have dom_kind=SS (2 leading S). Taking the minimum ensures the fuse/parallel scope is valid for every block that will be compute_at'd into it.

  3. Phase 2 -> 3 order: Confirmed safe. Phase 2 vectorizes only the inner loop of the last block (splitting loop[-1] into [outer, vec]), while Phase 3 compute_at moves other blocks under spatial (the outermost loop of the last block). These operate on disjoint parts of the loop nest. Verified with bit-exact correctness checks across 6 shape × 2 operator combinations.

Also extracted _vectorize_inner and _unroll_reduction_inner as static methods to reduce apply() complexity.

@swjng swjng marked this pull request as ready for review April 10, 2026 06:51
…atial axes

- Infer dtype_bits from the last block's write buffer instead of
  hardcoding 32, so float16/bfloat16 get correct vector lane counts.
- Support dynamic extents in split+vectorize by removing
  isinstance(extent, int) guards where TVM primitives handle them.
- Replace broad except Exception around llvm_get_vector_width with
  return-value check (returns -1 on failure, not an exception).
- Compute num_leading_s as min across ALL blocks, not just the first
  reduction block, ensuring compute_at safety.
- Extract _vectorize_inner and _unroll_reduction_inner as static
  methods to reduce apply() complexity.
@swjng swjng force-pushed the feat/add-cpu-dlight-reduction-rule branch from 4f7d06e to 4598374 Compare April 10, 2026 06:52
@swjng swjng requested a review from tlopex April 10, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants