cranelift(aarch64): lower the i8 relaxed dot product to NEON SDOT by darmie · Pull Request #13640 · bytecodealliance/wasmtime

darmie · 2026-06-15T13:51:05Z

On AArch64 with FEAT_DotProd, i32x4.relaxed_dot_i8x16_i7x16_add_s lowers to a
smull/smull2/addp/saddlp/add widening chain. This adds a has_dotprod
setting (auto-detected via cranelift-native) and lowers the op to a single
sdot — the instruction this relaxed op was designed to map to. Without the
feature, the existing fallback is unchanged.

There's no dedicated dot CLIF opcode (removed in #5889), so the wasm→CLIF
translator expands the op into a swiden/imul/iadd_pairwise/iadd tree; a
new ISLE rule contracts that tree back into sdot c, a, b.

Correctness

sdot is the signed, full-width, no-saturation dot product — exactly the
deterministic CLIF the translator already emits on AArch64 (the x86
pmaddubsw path is gated off here), so this is instruction selection, not a
semantic change. For the in-i7 range the op guarantees, the i16 pair sums
can't overflow, so sdot is bit-identical to the fallback; behavior differs
only in the spec's already-implementation-defined out-of-range zone, and the
relaxed-simd deterministic profile picks the same signed dot. Mirrors the
existing relaxed_madd → FMLA pattern of gating a native instruction behind a
feature.

Performance

cargo bench --bench relaxed_dot, Apple M1 Pro, wasmtime 47.0.0. The same
module is compiled twice toggling has_dotprod; the kernel is 8 independent
relaxed-dot accumulator chains (40M i8 4-way dots per call), criterion, 100
samples.

`has_dotprod`	Lowering	Time / call	Speedup
`false`	`smull`/`smull2`/`addp`/`saddlp`/`add`	17.12 ms	1.00×
`true`	`sdot`	9.32 ms	1.84×

Tests / bench

cranelift/filetests/filetests/isa/aarch64/simd-sdot.clif (precise-output lowering).
cranelift/filetests/filetests/runtests/simd-sdot.clif (executes the dot via the sdot path and the fallback across ISAs; results must agree.)
tests/disas/aarch64-relaxed-simd-dotprod.wat — golden disas (vs the fallback in aarch64-relaxed-simd.wat).
benches/relaxed_dot.rs (the benchmark above)

alexcrichton

Thanks for this!

alexcrichton · 2026-06-15T14:44:50Z

We don't typically host micro-benchmarks like this for individual instruction/architecture pairs, so it's ok to omit changes here in the benches directory

alexcrichton · 2026-06-15T14:51:37Z

Could this include a test interpret as well to verify it runs against the interpreter?

Done! It passes.

alexcrichton · 2026-06-15T15:01:30Z

+;; Signed i8 4-way dot product with accumulate (the shape wasm's
+;; `i32x4.relaxed_dot_i8x16_i7x16_add_s` lowers to). On aarch64 `has_dotprod`
+;; this runs through the new `sdot` lowering; the plain `aarch64` target (and
+;; the other ISAs) run the smull/saddlp widening fallback. All must agree.


It's ok to drop comments like this, LLMs generate text explaining what a test is but the text becomes irrelevant immediately after landing. This is "just another" test we have and you can leave a comment saying that this is testing the sdot lowering, but no need to call it a "new" lowering since after this PR merges that's stale text.

Okay 👍🏾. Noted for next time. I'll make the change.

With FEAT_DotProd, lower the wasm `i32x4.relaxed_dot_i8x16_i7x16_add_s` op to a single `sdot` rather than the smull/smull2/addp/saddlp/add widening fallback (~1.8x on Apple M1). Gated on a new `has_dotprod` setting; without it the existing fallback is unchanged.

darmie requested review from a team as code owners June 15, 2026 13:51

darmie requested review from alexcrichton and removed request for a team June 15, 2026 13:51

alexcrichton reviewed Jun 15, 2026

View reviewed changes

darmie force-pushed the aarch64-sdot-relaxed-dot branch from 3a067cc to 532bec1 Compare June 15, 2026 15:46

darmie requested a review from alexcrichton June 15, 2026 15:49

alexcrichton approved these changes Jun 15, 2026

View reviewed changes

alexcrichton added this pull request to the merge queue Jun 15, 2026

Merged via the queue into bytecodealliance:main with commit 8cb28bc Jun 15, 2026
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift(aarch64): lower the i8 relaxed dot product to NEON SDOT#13640

cranelift(aarch64): lower the i8 relaxed dot product to NEON SDOT#13640
alexcrichton merged 1 commit into
bytecodealliance:mainfrom
darmie:aarch64-sdot-relaxed-dot

darmie commented Jun 15, 2026

Uh oh!

alexcrichton left a comment

Uh oh!

alexcrichton Jun 15, 2026

Uh oh!

darmie Jun 15, 2026

Uh oh!

alexcrichton Jun 15, 2026

Uh oh!

darmie Jun 15, 2026

Uh oh!

alexcrichton Jun 15, 2026

Uh oh!

darmie Jun 15, 2026

Uh oh!

darmie Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

darmie commented Jun 15, 2026

Correctness

Performance

Tests / bench

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

darmie Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

darmie Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

darmie Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

darmie Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants