transpose: add num_batches to batch independent transposes into one dispatch#124
transpose: add num_batches to batch independent transposes into one dispatch#124atassis wants to merge 7 commits into
Conversation
andrej
left a comment
There was a problem hiding this comment.
Thank you for the contribution! This will be a useful addition. Just a couple nitpicks, then please rebase on devel and we can go ahead and merge this.
| # For num_batches>1 the L3 tensors hold that many contiguous (M,N) matrices, stacked along | ||
| # the row dimension: in-dims (num_batches*M, N), out-dims (num_batches*N, M). At num_batches==1 | ||
| # these reduce to (M,N)/(N,M) — identical to the original single-transpose patterns. Each (i,j) | ||
| # column/channel gets one TAP per batch (offset += batch*num_elements); the per-batch internal | ||
| # sizes/strides are unchanged because each matrix is contiguous and row-major. |
There was a problem hiding this comment.
Some of this comment reads a bit like a commit message / PR description, since it refers to the "original" implementation and "unchanged" things. This will be confusing when reading the code as-is once merged. Can you please reword this?
|
Note the CI failures we're seeing should disappear after a rebase. :) |
…ispatch GEMV and StridedCopy already take num_batches to batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern). num_batches>1 lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s*s sub-tiles, so the kernel is unchanged. num_batches=1 (default) is byte-identical to the previous single-transpose schedule.
Adds num_batches=2 (default suite) and num_batches=4 (extensive) cases to the
transpose test, with a batched golden reference. The operator's batched path was
previously untested. Verified on device (NPU2): num_batches in {1,2,4} pass.
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
722c6f9 to
4aa4c91
Compare
Drop the diff-relative phrasing ('original'/'unchanged') flagged in review;
the comment now describes the access-pattern layout as-is. Rationale moved to
the PR description.
|
Hi there! Thanks for the feedback, have applied it in both PRs. |
I think both of these contributions are valuable, so thank you for them! Your [Xilinx/mlir-aie#3178] PR has had human eyes on it, but, speaking for myself, I would prefer to see the CoPilot review issues there explained or resolved before commenting further. |
|
@thomthehound good point, resolved them |
GEMV and StridedCopy already take
num_batchesto batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern).Added
num_batchesonTranspose(default 1).num_batches>1lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s×s sub-tiles, so the kernel is unchanged.num_batches>1test coverage (the batched path was previously untested), with a batched golden reference.Changed
get_arg_specprepends a batch dim only whennum_batches>1;num_batches=1is byte-identical to the previous single-transpose schedule.Removed
Verified on device (NPU2):
num_batchesin {1, 2, 4} pass. Mirrors GEMV's existingnum_batches.