cranelift(aarch64): lower the i8 relaxed dot product to NEON SDOT#13640
Conversation
There was a problem hiding this comment.
We don't typically host micro-benchmarks like this for individual instruction/architecture pairs, so it's ok to omit changes here in the benches directory
There was a problem hiding this comment.
Could this include a test interpret as well to verify it runs against the interpreter?
| ;; Signed i8 4-way dot product with accumulate (the shape wasm's | ||
| ;; `i32x4.relaxed_dot_i8x16_i7x16_add_s` lowers to). On aarch64 `has_dotprod` | ||
| ;; this runs through the new `sdot` lowering; the plain `aarch64` target (and | ||
| ;; the other ISAs) run the smull/saddlp widening fallback. All must agree. |
There was a problem hiding this comment.
It's ok to drop comments like this, LLMs generate text explaining what a test is but the text becomes irrelevant immediately after landing. This is "just another" test we have and you can leave a comment saying that this is testing the sdot lowering, but no need to call it a "new" lowering since after this PR merges that's stale text.
There was a problem hiding this comment.
Okay 👍🏾. Noted for next time. I'll make the change.
With FEAT_DotProd, lower the wasm `i32x4.relaxed_dot_i8x16_i7x16_add_s` op to a single `sdot` rather than the smull/smull2/addp/saddlp/add widening fallback (~1.8x on Apple M1). Gated on a new `has_dotprod` setting; without it the existing fallback is unchanged.
3a067cc to
532bec1
Compare
On AArch64 with FEAT_DotProd,
i32x4.relaxed_dot_i8x16_i7x16_add_slowers to asmull/smull2/addp/saddlp/addwidening chain. This adds ahas_dotprodsetting (auto-detected via
cranelift-native) and lowers the op to a singlesdot— the instruction this relaxed op was designed to map to. Without thefeature, the existing fallback is unchanged.
There's no dedicated dot CLIF opcode (removed in #5889), so the wasm→CLIF
translator expands the op into a
swiden/imul/iadd_pairwise/iaddtree; anew ISLE rule contracts that tree back into
sdot c, a, b.Correctness
sdotis the signed, full-width, no-saturation dot product — exactly thedeterministic CLIF the translator already emits on AArch64 (the x86
pmaddubswpath is gated off here), so this is instruction selection, not asemantic change. For the in-i7 range the op guarantees, the i16 pair sums
can't overflow, so
sdotis bit-identical to the fallback; behavior differsonly in the spec's already-implementation-defined out-of-range zone, and the
relaxed-simd deterministic profile picks the same signed dot. Mirrors the
existing
relaxed_madd→ FMLA pattern of gating a native instruction behind afeature.
Performance
cargo bench --bench relaxed_dot, Apple M1 Pro, wasmtime 47.0.0. The samemodule is compiled twice toggling
has_dotprod; the kernel is 8 independentrelaxed-dot accumulator chains (40M i8 4-way dots per call), criterion, 100
samples.
has_dotprodfalsesmull/smull2/addp/saddlp/addtruesdotTests / bench
cranelift/filetests/filetests/isa/aarch64/simd-sdot.clif(precise-output lowering).cranelift/filetests/filetests/runtests/simd-sdot.clif(executes the dot via thesdotpath and the fallback across ISAs; results must agree.)tests/disas/aarch64-relaxed-simd-dotprod.wat— golden disas (vs the fallback inaarch64-relaxed-simd.wat).benches/relaxed_dot.rs(the benchmark above)