transpose: add num_batches to batch independent transposes into one dispatch by atassis · Pull Request #124 · amd/IRON

atassis · 2026-06-18T15:58:02Z

GEMV and StridedCopy already take num_batches to batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern).

Added

num_batches on Transpose (default 1). num_batches>1 lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s×s sub-tiles, so the kernel is unchanged.
num_batches>1 test coverage (the batched path was previously untested), with a batched golden reference.

Changed

get_arg_spec prepends a batch dim only when num_batches>1; num_batches=1 is byte-identical to the previous single-transpose schedule.

Removed

None.

Verified on device (NPU2): num_batches in {1, 2, 4} pass. Mirrors GEMV's existing num_batches.

andrej

Thank you for the contribution! This will be a useful addition. Just a couple nitpicks, then please rebase on devel and we can go ahead and merge this.

andrej · 2026-06-22T17:36:16Z

+    # For num_batches>1 the L3 tensors hold that many contiguous (M,N) matrices, stacked along
+    # the row dimension: in-dims (num_batches*M, N), out-dims (num_batches*N, M). At num_batches==1
+    # these reduce to (M,N)/(N,M) — identical to the original single-transpose patterns. Each (i,j)
+    # column/channel gets one TAP per batch (offset += batch*num_elements); the per-batch internal
+    # sizes/strides are unchanged because each matrix is contiguous and row-major.


Some of this comment reads a bit like a commit message / PR description, since it refers to the "original" implementation and "unchanged" things. This will be confusing when reading the code as-is once merged. Can you please reword this?

andrej · 2026-06-22T17:58:49Z

Note the CI failures we're seeing should disappear after a rebase. :)

…ispatch GEMV and StridedCopy already take num_batches to batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern). num_batches>1 lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s*s sub-tiles, so the kernel is unchanged. num_batches=1 (default) is byte-identical to the previous single-transpose schedule.

Adds num_batches=2 (default suite) and num_batches=4 (extensive) cases to the transpose test, with a batched golden reference. The operator's batched path was previously untested. Verified on device (NPU2): num_batches in {1,2,4} pass.

Co-authored-by: André Rösti <androsti@amd.com>

Drop the diff-relative phrasing ('original'/'unchanged') flagged in review; the comment now describes the access-pattern layout as-is. Rationale moved to the PR description.

atassis · 2026-06-22T19:37:02Z

Hi there! Thanks for the feedback, have applied it in both PRs.
I have also made a Xilinx/mlir-aie#3178 PR with compile-time fix O(n^2) to O(1) in several places which haven't got a human attention, it seems. I'll be grateful for any help to move that forward. I have a few more generic fixes queued after that, if one proves to be useful =)
BTW, I am trying to build a general purpose multi precision NPU inference engine on IRON for consumer Ryzen AI / XDNA2 laptops (might later make it work with other hardware, too, though, I have none to test against), and this work helped me to find out these fixes.
So, not promising, but might make some more PR's in close future!

thomthehound · 2026-06-22T20:18:06Z

Hi there! Thanks for the feedback, have applied it in both PRs. I have also made a Xilinx/mlir-aie#3178 PR with compile-time fix O(n^2) to O(1) in several places which haven't got a human attention, it seems. I'll be grateful for any help to move that forward. I have a few more generic fixes queued after that, if one proves to be useful =) BTW, I am trying to build a general purpose multi precision NPU inference engine on IRON for consumer Ryzen AI / XDNA2 laptops (might later make it work with other hardware, too, though, I have none to test against), and this work helped me to find out these fixes. So, not promising, but might make some more PR's in close future!

I think both of these contributions are valuable, so thank you for them!

Your [Xilinx/mlir-aie#3178] PR has had human eyes on it, but, speaking for myself, I would prefer to see the CoPilot review issues there explained or resolved before commenting further.

atassis · 2026-06-22T20:46:26Z

@thomthehound good point, resolved them

atassis requested review from andrej, hunhoffe and jgmelber as code owners June 18, 2026 15:58

jgmelber approved these changes Jun 22, 2026

View reviewed changes

andrej requested changes Jun 22, 2026

View reviewed changes

atassis and others added 6 commits June 22, 2026 22:05

Update iron/operators/transpose/op.py

93e120d

Co-authored-by: André Rösti <androsti@amd.com>

Update iron/operators/transpose/design.py

ed7f78b

Co-authored-by: André Rösti <androsti@amd.com>

Update iron/operators/transpose/design.py

12ea37f

Co-authored-by: André Rösti <androsti@amd.com>

Update iron/operators/transpose/reference.py

4aa4c91

Co-authored-by: André Rösti <androsti@amd.com>

atassis force-pushed the iron-transpose-numbatches branch from 722c6f9 to 4aa4c91 Compare June 22, 2026 19:06

transpose: reword L3 layout comment to be self-contained

a90732d

Drop the diff-relative phrasing ('original'/'unchanged') flagged in review; the comment now describes the access-pattern layout as-is. Rationale moved to the PR description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transpose: add num_batches to batch independent transposes into one dispatch#124

transpose: add num_batches to batch independent transposes into one dispatch#124
atassis wants to merge 7 commits into
amd:develfrom
atassis:iron-transpose-numbatches

atassis commented Jun 18, 2026

Uh oh!

andrej left a comment

Uh oh!

andrej Jun 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrej commented Jun 22, 2026

Uh oh!

atassis commented Jun 22, 2026

Uh oh!

thomthehound commented Jun 22, 2026

Uh oh!

atassis commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

atassis commented Jun 18, 2026

Uh oh!

andrej left a comment

Choose a reason for hiding this comment

Uh oh!

andrej Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrej commented Jun 22, 2026

Uh oh!

atassis commented Jun 22, 2026

Uh oh!

thomthehound commented Jun 22, 2026

Uh oh!

atassis commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants