Skip to content

transpose: add num_batches to batch independent transposes into one dispatch#124

Open
atassis wants to merge 7 commits into
amd:develfrom
atassis:iron-transpose-numbatches
Open

transpose: add num_batches to batch independent transposes into one dispatch#124
atassis wants to merge 7 commits into
amd:develfrom
atassis:iron-transpose-numbatches

Conversation

@atassis

@atassis atassis commented Jun 18, 2026

Copy link
Copy Markdown

GEMV and StridedCopy already take num_batches to batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern).

Added

  • num_batches on Transpose (default 1). num_batches>1 lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s×s sub-tiles, so the kernel is unchanged.
  • num_batches>1 test coverage (the batched path was previously untested), with a batched golden reference.

Changed

  • get_arg_spec prepends a batch dim only when num_batches>1; num_batches=1 is byte-identical to the previous single-transpose schedule.

Removed

  • None.

Verified on device (NPU2): num_batches in {1, 2, 4} pass. Mirrors GEMV's existing num_batches.

@andrej andrej left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! This will be a useful addition. Just a couple nitpicks, then please rebase on devel and we can go ahead and merge this.

Comment thread iron/operators/transpose/design.py Outdated
Comment on lines +53 to +57
# For num_batches>1 the L3 tensors hold that many contiguous (M,N) matrices, stacked along
# the row dimension: in-dims (num_batches*M, N), out-dims (num_batches*N, M). At num_batches==1
# these reduce to (M,N)/(N,M) — identical to the original single-transpose patterns. Each (i,j)
# column/channel gets one TAP per batch (offset += batch*num_elements); the per-batch internal
# sizes/strides are unchanged because each matrix is contiguous and row-major.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this comment reads a bit like a commit message / PR description, since it refers to the "original" implementation and "unchanged" things. This will be confusing when reading the code as-is once merged. Can you please reword this?

Comment thread iron/operators/transpose/design.py Outdated
Comment thread iron/operators/transpose/design.py Outdated
Comment thread iron/operators/transpose/op.py Outdated
Comment thread iron/operators/transpose/reference.py Outdated
@andrej

andrej commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Note the CI failures we're seeing should disappear after a rebase. :)

atassis and others added 6 commits June 22, 2026 22:05
…ispatch

GEMV and StridedCopy already take num_batches to batch B independent same-shape
operations into a single dispatch; Transpose did not, forcing callers to unroll
B per-head/per-batch transposes into B separate dispatches for identical kernel
work (a common multi-head-attention pattern).

num_batches>1 lays B contiguous (M,N) matrices back-to-back and streams them
through the same ObjectFifos (one task group per batch); the core still only
sees s*s sub-tiles, so the kernel is unchanged. num_batches=1 (default) is
byte-identical to the previous single-transpose schedule.
Adds num_batches=2 (default suite) and num_batches=4 (extensive) cases to the
transpose test, with a batched golden reference. The operator's batched path was
previously untested. Verified on device (NPU2): num_batches in {1,2,4} pass.
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
@atassis atassis force-pushed the iron-transpose-numbatches branch from 722c6f9 to 4aa4c91 Compare June 22, 2026 19:06
Drop the diff-relative phrasing ('original'/'unchanged') flagged in review;
the comment now describes the access-pattern layout as-is. Rationale moved to
the PR description.
@atassis

atassis commented Jun 22, 2026

Copy link
Copy Markdown
Author

Hi there! Thanks for the feedback, have applied it in both PRs.
I have also made a Xilinx/mlir-aie#3178 PR with compile-time fix O(n^2) to O(1) in several places which haven't got a human attention, it seems. I'll be grateful for any help to move that forward. I have a few more generic fixes queued after that, if one proves to be useful =)
BTW, I am trying to build a general purpose multi precision NPU inference engine on IRON for consumer Ryzen AI / XDNA2 laptops (might later make it work with other hardware, too, though, I have none to test against), and this work helped me to find out these fixes.
So, not promising, but might make some more PR's in close future!

@thomthehound

Copy link
Copy Markdown

Hi there! Thanks for the feedback, have applied it in both PRs. I have also made a Xilinx/mlir-aie#3178 PR with compile-time fix O(n^2) to O(1) in several places which haven't got a human attention, it seems. I'll be grateful for any help to move that forward. I have a few more generic fixes queued after that, if one proves to be useful =) BTW, I am trying to build a general purpose multi precision NPU inference engine on IRON for consumer Ryzen AI / XDNA2 laptops (might later make it work with other hardware, too, though, I have none to test against), and this work helped me to find out these fixes. So, not promising, but might make some more PR's in close future!

I think both of these contributions are valuable, so thank you for them!

Your [Xilinx/mlir-aie#3178] PR has had human eyes on it, but, speaking for myself, I would prefer to see the CoPilot review issues there explained or resolved before commenting further.

@atassis

atassis commented Jun 22, 2026

Copy link
Copy Markdown
Author

@thomthehound good point, resolved them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants