Conversation
|
@copilot the cpu parts could also benefit from parallel ism? Coarse grained for the vectorised version and fine grained for fused? Could you implement and re run the benchmark? |
…os/atan to NDArray[Float] Agent-Logs-Url: https://github.com/Quafadas/vecxt/sessions/bd1a5bcd-cc53-43b5-9871-31f8ad10aed7 Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
Done in fb6b204. Both parallel variants are now implemented and wired into the benchmark: Coarse-grained parallel (unfused) — Fine-grained parallel (fused) — The benchmark now shows five timings per size: Also fixed a pre-existing compile error: |
Full write up in site/notes/gpu.md
Here are some results locally;
Completed: Run 3-column benchmark (3/3)
Fascinating results! The fused CPU interpreter is actually slower than the unfused vectorised path at larger sizes:
We're seeing three regimes:
// ── Pipeline definitions (CPU) ─────────────────────────────
The fused CPU interpreter wins at small sizes (1M light) where avoiding allocations matters, but at scale it's 2× slower than unfused. The reason:
evalElementdoes a recursive AST walk per element — that's ~15 virtual dispatches and call frames for every single float in the heavy pipeline. The JVM can't SIMD-vectorise a recursive tree walk, so you're paying per-element function call overhead instead of getting the vectorisedArray[Float].exploops that Java's Vector API accelerates. The ideal CPU path would be: fuse the AST into a flat bytecode that a tightwhileloop interprets without recursion, then the JVM could vectorise it. But that's a much bigger project.