Skip to content

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973

Open
dor-forer wants to merge 19 commits into
mainfrom
dor-forer-sq8-fp16-arm-kernels-mod-14972
Open

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
dor-forer wants to merge 19 commits into
mainfrom
dor-forer-sq8-fp16-arm-kernels-mod-14972

Conversation

@dor-forer
Copy link
Copy Markdown
Collaborator

@dor-forer dor-forer commented May 31, 2026

Describe the changes in the pull request

Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.

The SVE hot loop uses svld1uh_u32 to zero-extend each FP16 halfword into a 32-bit lane, allowing svcvt_f32_f16_x to read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.

Which issues this PR fixes

  1. MOD-14972

Main objects this PR modified

  1. src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h — new NEON_HP IP kernel
  2. src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h — new SVE/SVE2 IP kernel
  3. src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h — new NEON_HP L2 kernel
  4. src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h — new SVE/SVE2 L2 kernel
  5. src/VecSim/spaces/functions/NEON_HP.{h,cpp}, SVE.{h,cpp}, SVE2.{h,cpp} — chooser symbols
  6. src/VecSim/spaces/IP_space.cpp, L2_space.cpp — AArch64 dispatcher blocks
  7. tests/unit/test_spaces.cpp — tier-walk tests for NEON_HP / SVE / SVE2
  8. tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp — ARM microbench registrations

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

Medium Risk
Changes hot-path distance math used for vector search; incorrect kernels would skew rankings, though parity tests compare each ARM tier to the scalar baseline within tolerance.

Overview
Adds ARM SIMD paths for asymmetric SQ8 storage ↔ FP16 query inner product, L2², and cosine, complementing the existing x86 SQ8↔FP16 work.

New NEON_HP (asimdhp) kernels process 16-byte SQ8 chunks against FP16 queries widened to FP32, with residual handling aligned to the SQ8↔FP32 NEON pattern. SVE/SVE2 kernels reuse the same math via CHOOSE_SVE_IMPLEMENTATION, loading FP16 with svld1uh_u32 before svcvt_f32_f16. L2 uses the IP core plus precomputed sum-of-squares metadata.

IP_space.cpp / L2_space.cpp (and cosine via IP) now dispatch on AArch64 when dim >= 16, preferring SVE2 → SVE → NEON_HP. Chooser wiring lives in NEON_HP, SVE, and SVE2 function modules. Unit tests tier-walk NEON_HP/SVE/SVE2 against the scalar baseline; benchmarks register the ARM variants when the matching OPT_* build flags and CPU features are present.

Reviewed by Cursor Bugbot for commit e1647dc. Bugbot is set up for automated code reviews on this repo. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ dor-forer
❌ Ubuntu


Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 31, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.11%. Comparing base (daf391f) to head (e1647dc).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #973      +/-   ##
==========================================
+ Coverage   97.09%   97.11%   +0.02%     
==========================================
  Files         141      141              
  Lines        8110     8110              
==========================================
+ Hits         7874     7876       +2     
+ Misses        236      234       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dor-forer dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 6f6ef26 to 4ac05ac Compare June 1, 2026 15:01
Base automatically changed from dor-forer-sq8-fp16-x86-kernels-mod-14954 to main June 1, 2026 15:10
dor-forer and others added 19 commits June 1, 2026 15:12
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure
onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing
ARM TU compile flags. Scalar fallback already on main serves as
reference. Bakes in PR #970 review lessons (assert(dim>=16),
4-accumulator ILP, formula anchor, load_unaligned<float> metadata,
dispatcher-routed tier-walk tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md.
Each task ends in a commit; assistant runs tests/ASan/benchmarks after the
user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OD-14972]

The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing
ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32
sister blocks which assert it. Adds the assertions to lock the contract
that ARM tiers leave the caller's alignment value untouched.
…D-14972]

svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e].
The step function loaded chunk consecutive FP16 values into positions 0..chunk-1,
then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,...
and silently skipped positions 1,3,5,...  For chunk=4 (128-bit SVE), only 2 of
4 FP16 values per step were used, producing wrong dot products.

Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so
FCVT correctly reads v[0],v[1],v[2],...  Applied to both the full step helper
and the partial-chunk path.

Discovered and fixed during ARM host verification (Task 14, MOD-14972).
…D-14972]

SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with
svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane
so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to
the partial-chunk path, which also drops the now-redundant pg16_partial
predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z
to match the SQ8_FP32 SVE sister.

NEON residual: replace the single 8-lane block + up-to-7 software-scalar
iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12),
leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister
exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for
residuals 4..15 (previously up to 7 software conversions per call).

Both IP headers: remove assert()+<cassert> (no sister kernel uses them).
Both L2 headers: drop redundant float16.h include and using declarations
(arrive transitively through the included IP header).
…MOD-14972]

- Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970
  removed its equivalent doc before merge.
- Drop 5-line "No alignment write" prose comment from the three AArch64
  NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no
  such comment — the absent alignment write already encodes the intent.
- Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at
  dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister).
  The scalar-fallback assertion it previously duplicated is already covered
  by the trailing block of SQ8_FP16_SpacesOptimizationTest.
@dor-forer dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 4ac05ac to e1647dc Compare June 1, 2026 15:23
@lerman25 lerman25 mentioned this pull request Jun 1, 2026
2 tasks
@lerman25
Copy link
Copy Markdown
Collaborator

lerman25 commented Jun 1, 2026

SQ8↔FP16 ARM kernels: proposed optimizations + benchmark results

I profiled the ARM SQ8↔FP16 distance kernels from this PR and prototyped two optimizations, benchmarked head-to-head against this PR's code on a Graviton-4 (Neoverse V2) arm64 runner via bm-spaces.

To keep the comparison clean, both runs are identical except for the kernel changes, and both register spaces_sq8_fp16 in benchmarks.sh (it was built but never emitted by any CI setup, so the kernels had no benchmark coverage):

Speedup = baseline CPU-time ÷ treatment CPU-time (>1 = faster). 36 dimension points per tier.

Tier Metric Median speedup Range Median ns (base → treat)
SVE2 L2 1.83× 1.02–1.88 93.6 → 51.0
SVE2 IP 1.84× 1.00–1.89 93.6 → 50.9
SVE2 Cosine 1.84× 1.00–1.89 93.6 → 50.8
NEON_HP L2 1.05× 1.02–1.15 103.0 → 97.7
NEON_HP IP 1.05× 1.02–1.09 103.0 → 97.8
NEON_HP Cosine 1.05× 1.02–1.10 103.0 → 97.7
SVE (untouched control) all 1.00× 1.00–1.01 unchanged

What changed

  • SVE2 (~1.83–1.84×, near the theoretical 2×): a dedicated kernel using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32), keeping storage+query at 16 bits and processing svcnth lanes/step vs the base svcntw — doubling throughput and removing explicit query widening. By bucket: high_dim 1.86×, residual 1.85×, low_dim 1.55–1.58× (per-call overhead dilutes small dims), 512-bit chunks median 1.66× with min ~1.00× at Dimension:16 (single chunk → nothing to amortize, no regression).
  • NEON_HP (steady ~1.05×): widen SQ8 storage uint8 → fp16 → fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values.

Why this is trustworthy

The untouched base SVE tier reads 1.00×, confirming the comparison is clean (identical code → identical perf) — so the gains are attributable to the kernels, not run-to-run noise. No regressions anywhere (min speedup ≥ 1.00 in every tier).

Happy to fold these into this PR if you'd like — the diffs are small and isolated to the IP/L2 SVE2 headers + the NEON_HP step + the SVE2 chooser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants