Skip to content

SVE_MATCH FIX - add fallback path if nkeys > VL#3345

Merged
jasonrandrews merged 1 commit into
ArmDeveloperEcosystem:mainfrom
kieranhejmadi01:improvement/sve_match
Jun 2, 2026
Merged

SVE_MATCH FIX - add fallback path if nkeys > VL#3345
jasonrandrews merged 1 commit into
ArmDeveloperEcosystem:mainfrom
kieranhejmadi01:improvement/sve_match

Conversation

@kieranhejmadi01
Copy link
Copy Markdown
Contributor

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

@kieranhejmadi01
Copy link
Copy Markdown
Contributor Author

As per issue #3302.

Fixes Investigated:

1. Using Intrinsics to test if nkeys > VL and create chunks.

However the overhead lead seriously degraded performance. Exacerbated when length of haystack was small. Implementation in linked issue.

2. Use template specialization

provide one implementation when nkeys < VL and a separate implementation when nkeys > VL.
Implementation was is C, not C++ so would require refactoring and likely would result in negligible perf improvement based on experiement 1.

3. Fallback to scalar implementation

The performance impact from the additional check was negligible and well within any margin of error. As such, opted to not modify the results as they were likely within the margin or error on the same graviton4 machine.

Haystack length : 65536 elements
Iterations      : 3
Hit probability : 0.000010 (0.0010 % )

Average latency over 3 iterations (ns):
  generic_u8       : 77618.67
  sve2_u8          : 1053.67
  sve2_u8_unrolled : 864.00
  speed‑up (orig)  : 73.67x
  speed‑up (unroll): 89.84x

  generic_u16      : 84841.00
  sve2_u16         : 3953.00
  sve2_u16_unrolled: 3052.33
  speed‑up (orig)  : 21.46x
  speed‑up (unroll): 27.80x

Throughput (million items/second):
  generic_u8       : 844.33 Mi/s
  sve2_u8          : 62198.04 Mi/s
  sve2_u8_unrolled : 75851.85 Mi/s
  speed‑up (orig)  : 73.67x
  speed‑up (unroll): 89.84x

  generic_u16      : 772.46 Mi/s
  sve2_u16         : 16578.80 Mi/s
  sve2_u16_unrolled: 21470.79 Mi/s
  speed‑up (orig)  : 21.46x
  speed‑up (unroll): 27.80x

Occam's razor held true and a simple fallback was most performant.

@jasonrandrews jasonrandrews merged commit 8fedf1b into ArmDeveloperEcosystem:main Jun 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants