Skip to content

Add stream-dse fused SwiGLU-prefill operator#122

Open
asyms wants to merge 1 commit into
amd:develfrom
KULeuven-MICAS:stream-dse-fused-swiglu
Open

Add stream-dse fused SwiGLU-prefill operator#122
asyms wants to merge 1 commit into
amd:develfrom
KULeuven-MICAS:stream-dse-fused-swiglu

Conversation

@asyms

@asyms asyms commented Jun 18, 2026

Copy link
Copy Markdown

Adds SwiGLUPrefillStream, a fused SwiGLU-prefill operator whose single MLIR design (gate/up GEMMs + SiLU + elementwise-mul + down GEMM) is generated by stream-dse and compiled to one xclbin, instead of chaining separately-compiled sub-operators.

Its per-kernel operand layouts (the tiled-strided DMA tiling) are authored on the IRON side and injected into stream-dse code generation via optimize_allocation_co(kernels=...) — so IRON owns the layouts while stream keeps kernel construction and the MLIR rewrite, instead of the layouts being hand-copied on both sides.

Added

  • SwiGLUPrefillStream (iron/operators/swiglu_prefill_stream/): fused stream-dse design → one xclbin; MLIR generated at build time by stream_design.py.
  • iron.common.layout: a TiledStridedLayout type (with to_snaxc()) for handing IRON-authored operand layouts to stream-dse.
  • stream_kernels.py: injects IRON's operand layouts into codegen through the kernels= override, replacing only operand_layouts() on stream's own kernels (requires stream-dse ≥ 1.13.4).
  • requirements_stream.txt (optional dependency stream-dse>=1.13.4); the operator's test skips when stream-dse is absent.
  • Minimal demo under demos/swiglu_prefill_stream/.

Changed

  • Importing iron.operators no longer requires an NPU runtime: lazy XRT/pyxrt import and PEP 562 lazy operator exports, so the package loads (and tests collect) on hosts without XRT/pyxrt.

Removed

  • None.

Running the demo

Prerequisites: the XDNA driver + XRT installed (/opt/xilinx/xrt) and an npu2 device. From a fresh clone of this branch:

python3 -m venv .venv && source .venv/bin/activate
source /opt/xilinx/xrt/setup.sh            # provides pyxrt
pip install --upgrade pip
pip install -r requirements.txt            # IRON + mlir_aie/llvm-aie toolchain + torch
pip install -r requirements_stream.txt     # stream-dse>=1.13.4 (PyPI)
stream-setup-aie                           # required: installs snaxc / xdsl-aie / aie-python-extras
python demos/swiglu_prefill_stream/demo.py

This generates the fused design with stream-dse, compiles it to an xclbin, and runs it once on the NPU (≈2 ms for the 256×512×2048 shape). stream-setup-aie is required: it installs the AIE codegen packages stream-dse needs that cannot be plain PyPI dependencies.

Licensing note

The new IRON-side files — iron/common/layout.py, iron/operators/swiglu_prefill_stream/stream_kernels.py, and demos/swiglu_prefill_stream/demo.py — carry a KU Leuven (MICAS) copyright header (Apache-2.0), as they were authored by MICAS; all other touched files keep their existing AMD headers. We can discuss this further.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR has been reviewed and approved.
  3. All checks are passing.

@andrej

andrej commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Hi Arne, sorry for the CI failures, if you rebase on #125 maybe once it's merged hopefully these should pass

SwiGLUPrefillStream compiles the whole SwiGLU-prefill block (gate/up GEMMs +
SiLU + elementwise-mul + down GEMM) as a single fused MLIR design generated by
stream-dse, producing one xclbin instead of chaining separately-compiled
sub-operators. The design is generated at build time by stream_design.py and
compiled through IRON's normal flow.

The fused design's per-kernel operand layouts (the tiled-strided DMA tiling) are
authored on the IRON side and fed into stream-dse code generation rather than
hand-copied inside stream: iron.common.layout provides a TiledStridedLayout type,
and swiglu_prefill_stream/stream_kernels.py injects IRON's layouts through
optimize_allocation_co(kernels=...) -- the override hook added in stream-dse
1.13.4 -- keeping stream's kernel construction and replacing only
operand_layouts().

stream-dse is an optional dependency (requirements_stream.txt); the operator's
test skips when it is absent. Importing iron.operators no longer requires an NPU
runtime (lazy XRT import), so the package loads on hosts without XRT/pyxrt.
Includes a minimal k=1 demo under demos/swiglu_prefill_stream/.
@asyms asyms force-pushed the stream-dse-fused-swiglu branch from 7f54de3 to 5103ce5 Compare June 22, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants