Add stream-dse fused SwiGLU-prefill operator#122
Open
asyms wants to merge 1 commit into
Open
Conversation
Collaborator
|
Hi Arne, sorry for the CI failures, if you rebase on #125 maybe once it's merged hopefully these should pass |
SwiGLUPrefillStream compiles the whole SwiGLU-prefill block (gate/up GEMMs + SiLU + elementwise-mul + down GEMM) as a single fused MLIR design generated by stream-dse, producing one xclbin instead of chaining separately-compiled sub-operators. The design is generated at build time by stream_design.py and compiled through IRON's normal flow. The fused design's per-kernel operand layouts (the tiled-strided DMA tiling) are authored on the IRON side and fed into stream-dse code generation rather than hand-copied inside stream: iron.common.layout provides a TiledStridedLayout type, and swiglu_prefill_stream/stream_kernels.py injects IRON's layouts through optimize_allocation_co(kernels=...) -- the override hook added in stream-dse 1.13.4 -- keeping stream's kernel construction and replacing only operand_layouts(). stream-dse is an optional dependency (requirements_stream.txt); the operator's test skips when it is absent. Importing iron.operators no longer requires an NPU runtime (lazy XRT import), so the package loads on hosts without XRT/pyxrt. Includes a minimal k=1 demo under demos/swiglu_prefill_stream/.
7f54de3 to
5103ce5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
SwiGLUPrefillStream, a fused SwiGLU-prefill operator whose single MLIR design (gate/up GEMMs + SiLU + elementwise-mul + down GEMM) is generated by stream-dse and compiled to one xclbin, instead of chaining separately-compiled sub-operators.Its per-kernel operand layouts (the tiled-strided DMA tiling) are authored on the IRON side and injected into stream-dse code generation via
optimize_allocation_co(kernels=...)— so IRON owns the layouts while stream keeps kernel construction and the MLIR rewrite, instead of the layouts being hand-copied on both sides.Added
SwiGLUPrefillStream(iron/operators/swiglu_prefill_stream/): fused stream-dse design → one xclbin; MLIR generated at build time bystream_design.py.iron.common.layout: aTiledStridedLayouttype (withto_snaxc()) for handing IRON-authored operand layouts to stream-dse.stream_kernels.py: injects IRON's operand layouts into codegen through thekernels=override, replacing onlyoperand_layouts()on stream's own kernels (requires stream-dse ≥ 1.13.4).requirements_stream.txt(optional dependencystream-dse>=1.13.4); the operator's test skips when stream-dse is absent.demos/swiglu_prefill_stream/.Changed
iron.operatorsno longer requires an NPU runtime: lazy XRT/pyxrt import and PEP 562 lazy operator exports, so the package loads (and tests collect) on hosts without XRT/pyxrt.Removed
Running the demo
Prerequisites: the XDNA driver + XRT installed (
/opt/xilinx/xrt) and annpu2device. From a fresh clone of this branch:This generates the fused design with stream-dse, compiles it to an xclbin, and runs it once on the NPU (≈2 ms for the 256×512×2048 shape).
stream-setup-aieis required: it installs the AIE codegen packages stream-dse needs that cannot be plain PyPI dependencies.Licensing note
The new IRON-side files —
iron/common/layout.py,iron/operators/swiglu_prefill_stream/stream_kernels.py, anddemos/swiglu_prefill_stream/demo.py— carry aKU Leuven (MICAS)copyright header (Apache-2.0), as they were authored by MICAS; all other touched files keep their existing AMD headers. We can discuss this further.PR Merge Checklist
develcommit and pointing todevel.