Skip to content

cranelift(aarch64): lower the i8 relaxed dot product to NEON SDOT#13640

Merged
alexcrichton merged 1 commit into
bytecodealliance:mainfrom
darmie:aarch64-sdot-relaxed-dot
Jun 15, 2026
Merged

cranelift(aarch64): lower the i8 relaxed dot product to NEON SDOT#13640
alexcrichton merged 1 commit into
bytecodealliance:mainfrom
darmie:aarch64-sdot-relaxed-dot

Conversation

@darmie

@darmie darmie commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

On AArch64 with FEAT_DotProd, i32x4.relaxed_dot_i8x16_i7x16_add_s lowers to a
smull/smull2/addp/saddlp/add widening chain. This adds a has_dotprod
setting (auto-detected via cranelift-native) and lowers the op to a single
sdot — the instruction this relaxed op was designed to map to. Without the
feature, the existing fallback is unchanged.

There's no dedicated dot CLIF opcode (removed in #5889), so the wasm→CLIF
translator expands the op into a swiden/imul/iadd_pairwise/iadd tree; a
new ISLE rule contracts that tree back into sdot c, a, b.

Correctness

sdot is the signed, full-width, no-saturation dot product — exactly the
deterministic CLIF the translator already emits on AArch64 (the x86
pmaddubsw path is gated off here), so this is instruction selection, not a
semantic change. For the in-i7 range the op guarantees, the i16 pair sums
can't overflow, so sdot is bit-identical to the fallback; behavior differs
only in the spec's already-implementation-defined out-of-range zone, and the
relaxed-simd deterministic profile picks the same signed dot. Mirrors the
existing relaxed_madd → FMLA pattern of gating a native instruction behind a
feature.

Performance

cargo bench --bench relaxed_dot, Apple M1 Pro, wasmtime 47.0.0. The same
module is compiled twice toggling has_dotprod; the kernel is 8 independent
relaxed-dot accumulator chains (40M i8 4-way dots per call), criterion, 100
samples.

has_dotprod Lowering Time / call Speedup
false smull/smull2/addp/saddlp/add 17.12 ms 1.00×
true sdot 9.32 ms 1.84×

Tests / bench

  • cranelift/filetests/filetests/isa/aarch64/simd-sdot.clif (precise-output lowering).
  • cranelift/filetests/filetests/runtests/simd-sdot.clif (executes the dot via the sdot path and the fallback across ISAs; results must agree.)
  • tests/disas/aarch64-relaxed-simd-dotprod.wat — golden disas (vs the fallback in aarch64-relaxed-simd.wat).
  • benches/relaxed_dot.rs (the benchmark above)

@darmie darmie requested review from a team as code owners June 15, 2026 13:51
@darmie darmie requested review from alexcrichton and removed request for a team June 15, 2026 13:51

@alexcrichton alexcrichton left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

Comment thread benches/relaxed_dot/kernel.wat Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't typically host micro-benchmarks like this for individual instruction/architecture pairs, so it's ok to omit changes here in the benches directory

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this include a test interpret as well to verify it runs against the interpreter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! It passes.

Comment on lines +7 to +10
;; Signed i8 4-way dot product with accumulate (the shape wasm's
;; `i32x4.relaxed_dot_i8x16_i7x16_add_s` lowers to). On aarch64 `has_dotprod`
;; this runs through the new `sdot` lowering; the plain `aarch64` target (and
;; the other ISAs) run the smull/saddlp widening fallback. All must agree.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to drop comments like this, LLMs generate text explaining what a test is but the text becomes irrelevant immediately after landing. This is "just another" test we have and you can leave a comment saying that this is testing the sdot lowering, but no need to call it a "new" lowering since after this PR merges that's stale text.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay 👍🏾. Noted for next time. I'll make the change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

With FEAT_DotProd, lower the wasm `i32x4.relaxed_dot_i8x16_i7x16_add_s` op to a
single `sdot` rather than the smull/smull2/addp/saddlp/add widening fallback
(~1.8x on Apple M1). Gated on a new `has_dotprod` setting; without it the
existing fallback is unchanged.
@darmie darmie force-pushed the aarch64-sdot-relaxed-dot branch from 3a067cc to 532bec1 Compare June 15, 2026 15:46
@darmie darmie requested a review from alexcrichton June 15, 2026 15:49
@alexcrichton alexcrichton added this pull request to the merge queue Jun 15, 2026
Merged via the queue into bytecodealliance:main with commit 8cb28bc Jun 15, 2026
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants