Skip to content

Documentation for linalg.softmax lowering in lighthouse + IREE attention lowering walkthrough #2

Open
charithaintc wants to merge 42 commits intomainfrom
softmax_doc
Open

Documentation for linalg.softmax lowering in lighthouse + IREE attention lowering walkthrough #2
charithaintc wants to merge 42 commits intomainfrom
softmax_doc

Conversation

@charithaintc
Copy link
Copy Markdown
Owner

No description provided.

@charithaintc charithaintc changed the title Documentation for linalg.softmax lowering in lighthouse. Documentation for linalg.softmax lowering in lighthouse. Apr 3, 2026
@charithaintc charithaintc changed the title Documentation for linalg.softmax lowering in lighthouse. Documentation for linalg.softmax lowering in lighthouse + IREE attention lowering walkthrough Apr 8, 2026
- **Vectorization** via IREE's vector distribution pipeline
- **Mapping to MMA intrinsics** (e.g., MFMA on MI300X) for the two matmuls (Steps 1 and 5)
- **Register-level tiling** and shared memory promotion for GPU targets
- The `scf.for` loop around these ops implements the streaming/online iteration over K/V chunks
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How Iree fuses these scf.for loop?

| 3a | `P = exp(S - new_max)` | Elementwise | `[16, 16]` |
| 3b | `alpha = exp(old_max - new_max)` | Elementwise | `[16]` |
| 4 | `new_sum = alpha * old_sum + Σ P` | Scale + row reduction | `[16]` |
| 5 | `new_acc = alpha * old_acc + P @ V` | Scale + matmul | `[16, 64]` ← `[16, 16] × [16, 64]` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MLIR] Fusible Softmax with Following Matrix Multiplication · Issue #1617 · intel-innersource/frame…
describes a high level idea that try to decompse softmax to the step 2/3a/3b/4/5' (with V replaced as I, so using P@I instead of P@V), which allows P@V being fused. Since the last step has same loop structure, the second GEMM loop would be able to be fused into the softmax. But not sure how the linalg tile/fusion can be enhanced to support this fusion.

**Notes**
- Sets the layout for anchor xegpu ops. Each Wg consistes of [8, 1] subgroups
doing 8x64 softmax slice.
- Only sets the layotu for `store_nd`. Layout propagation does the rest.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: layotu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants