Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
Conversation
|
Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
==========================================
+ Coverage 97.09% 97.11% +0.02%
==========================================
Files 141 141
Lines 8110 8110
==========================================
+ Hits 7874 7876 +2
+ Misses 236 234 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6f6ef26 to
4ac05ac
Compare
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OD-14972] The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32 sister blocks which assert it. Adds the assertions to lock the contract that ARM tiers leave the caller's alignment value untouched.
…D-14972] svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e]. The step function loaded chunk consecutive FP16 values into positions 0..chunk-1, then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,... and silently skipped positions 1,3,5,... For chunk=4 (128-bit SVE), only 2 of 4 FP16 values per step were used, producing wrong dot products. Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so FCVT correctly reads v[0],v[1],v[2],... Applied to both the full step helper and the partial-chunk path. Discovered and fixed during ARM host verification (Task 14, MOD-14972).
…D-14972] SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to the partial-chunk path, which also drops the now-redundant pg16_partial predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z to match the SQ8_FP32 SVE sister. NEON residual: replace the single 8-lane block + up-to-7 software-scalar iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12), leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for residuals 4..15 (previously up to 7 software conversions per call). Both IP headers: remove assert()+<cassert> (no sister kernel uses them). Both L2 headers: drop redundant float16.h include and using declarations (arrive transitively through the included IP header).
…MOD-14972] - Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970 removed its equivalent doc before merge. - Drop 5-line "No alignment write" prose comment from the three AArch64 NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no such comment — the absent alignment write already encodes the intent. - Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister). The scalar-fallback assertion it previously duplicated is already covered by the trailing block of SQ8_FP16_SpacesOptimizationTest.
4ac05ac to
e1647dc
Compare
SQ8↔FP16 ARM kernels: proposed optimizations + benchmark resultsI profiled the ARM SQ8↔FP16 distance kernels from this PR and prototyped two optimizations, benchmarked head-to-head against this PR's code on a Graviton-4 (Neoverse V2) arm64 runner via To keep the comparison clean, both runs are identical except for the kernel changes, and both register
Speedup = baseline CPU-time ÷ treatment CPU-time (>1 = faster). 36 dimension points per tier.
What changed
Why this is trustworthyThe untouched base SVE tier reads 1.00×, confirming the comparison is clean (identical code → identical perf) — so the gains are attributable to the kernels, not run-to-run noise. No regressions anywhere (min speedup ≥ 1.00 in every tier). Happy to fold these into this PR if you'd like — the diffs are small and isolated to the IP/L2 SVE2 headers + the NEON_HP step + the SVE2 chooser. |
Describe the changes in the pull request
Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.
The SVE hot loop uses
svld1uh_u32to zero-extend each FP16 halfword into a 32-bit lane, allowingsvcvt_f32_f16_xto read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.Which issues this PR fixes
Main objects this PR modified
src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h— new NEON_HP IP kernelsrc/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h— new SVE/SVE2 IP kernelsrc/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h— new NEON_HP L2 kernelsrc/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h— new SVE/SVE2 L2 kernelsrc/VecSim/spaces/functions/NEON_HP.{h,cpp},SVE.{h,cpp},SVE2.{h,cpp}— chooser symbolssrc/VecSim/spaces/IP_space.cpp,L2_space.cpp— AArch64 dispatcher blockstests/unit/test_spaces.cpp— tier-walk tests for NEON_HP / SVE / SVE2tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp— ARM microbench registrationsMark if applicable
Note
Medium Risk
Changes hot-path distance math used for vector search; incorrect kernels would skew rankings, though parity tests compare each ARM tier to the scalar baseline within tolerance.
Overview
Adds ARM SIMD paths for asymmetric SQ8 storage ↔ FP16 query inner product, L2², and cosine, complementing the existing x86 SQ8↔FP16 work.
New NEON_HP (
asimdhp) kernels process 16-byte SQ8 chunks against FP16 queries widened to FP32, with residual handling aligned to the SQ8↔FP32 NEON pattern. SVE/SVE2 kernels reuse the same math viaCHOOSE_SVE_IMPLEMENTATION, loading FP16 withsvld1uh_u32beforesvcvt_f32_f16. L2 uses the IP core plus precomputed sum-of-squares metadata.IP_space.cpp/L2_space.cpp(and cosine via IP) now dispatch on AArch64 whendim >= 16, preferring SVE2 → SVE → NEON_HP. Chooser wiring lives inNEON_HP,SVE, andSVE2function modules. Unit tests tier-walk NEON_HP/SVE/SVE2 against the scalar baseline; benchmarks register the ARM variants when the matchingOPT_*build flags and CPU features are present.Reviewed by Cursor Bugbot for commit e1647dc. Bugbot is set up for automated code reviews on this repo. Configure here.