Skip to content

Conversation

@JulianJuelg
Copy link

This PR adds a Java Vector API implementation for dense codegen primitives in the following groups:

  • Aggregation
  • Division
  • Comparison
  • Multiply-add (remaining)

The new vectorized implementations were benchmarked against the previous scalar-loop versions (see results below) with JMH microbenchmarks and a standalone Java benchmark suite included in this PR. In most cases, both harnesses show the same trend. In caseswhere they differ slightly, JMH is used as the primary signal due to lower volatility.

For each primitive, I compared the Vector API version to the existing scalar loop:

  • If performance was equal, or better, I replaced the scalar loop with the vectorized implementation.
  • If the Vector API version was slower, I kept the scalar implementation as the default and left the vectorized version in the codebase for reference

Benchmark setup
JDK version : 21
JMH version: 1.37
OS: macOS
Machine: (Apple M2/M, 16 GB RAM, 128-bit vector width/ SIMD)
Input size (double arrays): 1,000,000 elements
Warmup time: 1s per primitive
Measurement: 1 Iteration
JMH params: 2 Forks

Note: These benchmarks were run with a 128-bit SIMD vector width, which is only 2 lanes for doubles. On production deployments with wider SIMD (e.g., 256-bit or 512-bit where available), the vectorized implementations are expected to provide equal or better speedups due to increased lane-level parallelism.

Primitive Function ns/op (JMH) JMH Test: Speedup with Vector API Java Test: Speedup with Vector API Replaced
vectDivAdd 231671 1.066 1.887 Yes
vectDivAdd2 218818 1.066 1.686 Yes
vectDivWrite 359339 0.687 1.489 No
vectDivWrite2 343183 0.7215 0.717 No
vectDivWrite3 535898 0.7821 0.603 No
rowMaxsVectMult 298328 1.006 1.346 Yes
rowMaxsVectMult_aix 738767 0.115 0.077 No
vectSum 142065 0.322 0.565 No
vectMax 596046 2.002 1.933 Yes
vectCountnnz 297805 1.594 1.538 Yes
vectEqualAdd 427437 1.959 2.077 Yes
vectEqualWrite2 414717 1.183 0.801 Yes
vectEqualWrite 415329 1.189 1.402 Yes
vectGreaterAdd 427981 1.936 2.114 Yes
vectGreaterWrite2 552023 0.588 0.919 No
vectGreaterWrite 458332 1.309 0.927 Yes
vectLessAdd 531844 2.433 2.052 Yes
vectLessWrite2 545457 1.011 0.951 Yes
vectLessWrite 414025 1.203 1.039 Yes
vectLessequalAdd 426307 1.960 2.052 Yes
vectLessequalWrite2 540476 1.014 0.962 Yes
vectLessequalWrite 414514 1.181 0.953 Yes
vectMin 589668 2.000 1.996 Yes
vectMult2Add 228636 1.052 1.284 Yes
vectMult2Write 377074 2.136 1.375 Yes
vectNotequalAdd 424749 1.945 1.643 Yes
vectNotequalWrite2 566433 0.714 0.821 No
vectNotequalWrite 417206 1.203 0.941 Yes

@github-project-automation github-project-automation bot moved this to In Progress in SystemDS PR Queue Jan 30, 2026
@JulianJuelg JulianJuelg changed the title Vector API Implementation for dense codegen primitives (Divisions, Aggregations, Comparisons, MultiplyAdd) + benchmarks [SYSTEMDS-3920] Vector API Implementation for dense codegen primitives (Divisions, Aggregations, Comparisons, MultiplyAdd) + benchmarks Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant