Skip to content

make sort32 fast#327

Merged
dmarcos merged 4 commits into
sparkjsdev:mainfrom
39ali:sort32-fast
Jun 8, 2026
Merged

make sort32 fast#327
dmarcos merged 4 commits into
sparkjsdev:mainfrom
39ali:sort32-fast

Conversation

@39ali

@39ali 39ali commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

try to improve the performance of sort32, on avg it's 30-40% faster .

things that changed :

  • pass 2 no longer re-reads keys[] , scratch stores a packed u64 of (inverted_key << 32 | original_index). pass 2 reads the high 16 bits directly from scratch with kv >> 48 making it a sequential scan

  • histogram and scatter are now branchless to help llvm vectorize the loop

  • manually unrolled histogram and both scatter passes to 8-wide

Comment thread rust/spark-worker-rs/src/sort.rs Outdated
/// Two‑pass radix sort (base 2¹⁶) of 32‑bit float bit‑patterns,
/// descending order (largest keys first). Mirrors the JS `sort32Splats`.
#[inline(always)]
unsafe fn prefix_sum_exclusive(buckets: &mut [u32]) -> u32 {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason this is marked unsafe? It compiles just fine without.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had many experiments with simd, which didn't make it marginally faster so i removed it for simplicity sake but forgot to remove the unsafe, will clean it up

@mrxz

mrxz commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

Awesome work, gave it a try and can confirm that it improves sorting performance. In my limited testing I saw ~20% reduction in sorting time (~25% faster).

manually unrolled histogram and both scatter passes to 8-wide

Without this change the performance gain seems to be roughly the same, or at least I didn't observe any significant difference. The majority of the benefit seems to come from making it branchless.

@39ali

39ali commented Apr 29, 2026

Copy link
Copy Markdown
Contributor Author

@mrxz i squeezed a bit more performance ~<=1ms by removing more branches from hot loops, and what you noticed seems about right, it will differ from one wasm engine to another, and arch to another(specially cache sizes and arch) so it's hard to give a solid number but it'll still be a pump in performance

@39ali 39ali force-pushed the sort32-fast branch 2 times, most recently from 8c1efb4 to 77253d1 Compare April 29, 2026 17:42
@dmarcos

dmarcos commented May 14, 2026

Copy link
Copy Markdown
Contributor

can you remove the changes in the dist directory?

@asundqui

Copy link
Copy Markdown
Contributor

@39ali great work! This looks like a cool win indeed, thanks for the work. Could remove the build in the dist/ folder from your branch, then we can merge it in?

@mrxz

mrxz commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Could the macros used for unrolling the loops also be used for the body of remainder loops? Both should be identical, so if we could avoid the duplication we avoid the risk of it ever getting out of sync.

@dmarcos

dmarcos commented May 15, 2026

Copy link
Copy Markdown
Contributor

@39ali any chance for you to implement @mrxz suggestions? Thanks so much for the contribution

@39ali

39ali commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

I'll implement the changes

@39ali

39ali commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@mrxz @dmarcos @asundqui done !

…the essential optimizations. Removed second branchless optimization. Added comments on why `unsafe` accesses are okay.
@asundqui

asundqui commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

@39ali really great work here! I've done some benchmarking on your method, and I'm actually getting 2x - 4x speedups in sorting from this. I'm truly shocked that this was possible! This will have a great impact on Spark's sorting performance. On a 10M splat scene on my M3 it goes from 250 ms to 60 ms or so. It's possible that the speedup is not as great on other environments, such as @mrxz was reporting 25% speedups on his system.

I went through and carefully separated the optimizations and measured them:

  • I found that approx 65% of the gain could be explained by storing (key, index) as a packed u64 in scratch, which turned the next loop from a random gather into a sequential read. So much better cache performance as a result.

  • The next 20% came from doing unsafe { unchecked... } array accesses where the compiler couldn't be sure that it would always be in bounds, so it has to check it every iteration. I went through the logic and it looks like it should always be in bounds.

  • The final 15% or so came from unrolling the loops. I had thought the unrolling couldn't possibly help because of branch prediction + cpu instruction reordering but it all helps!

It did seem like there was one error though: the second branchless loop seems problematic... I think writing to the array and only advancing the pointer if it's "valid" could overwrite things. So I removed it. I don't think it does very much for the performance anyway.

Finally I reverted some unnecessary changes to make it closer to @mrxz 's original formulation. I think we should merge this in @dmarcos , @mrxz ! WDYT? This should really help with #225 .

Interestingly, because the sorting is so much faster, it sort of exposes the next bottleneck more: uploading the ordering frequently to the GPU can cause stuttering sometimes when the counts get large. Now this happens more often!

@mrxz

mrxz commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

On a 10M splat scene on my M3 it goes from 250 ms to 60 ms or so. It's possible that the speedup is not as great on other environments, such as @mrxz was reporting 25% speedups on his system.

There is bound to be some variability between setups, but that's quite the discrepancy between the measurements. My guess would be that besides the gains of being branchless and more cache friendly it somehow avoids a slow-path on the M3? Regardless, it's a net positive, even the comparatively modest speed-up I've measured is a very welcome improvement.

It did seem like there was one error though: the second branchless loop seems problematic... I think writing to the array and only advancing the pointer if it's "valid" could overwrite things. So I removed it. I don't think it does very much for the performance anyway.

The overwriting was intentional AFAICT. Since the 'invalid' entries aren't tallied you don't want to count them when scattering either. The corollary is that by only advancing after writing a valid entry, you guarantee that the final write to the array at a given position is a valid entry. The only way this goes wrong is once a bucket has been fully scattered, as then it'll point into the start of the next bucket. The inverted Infinity value, however, has 0xFFFF as its low bytes, always placing it in the final bucket which can safely overflow up to max_splats.

That said, I do think the explicit check is more readable and some quick testing doesn't show a meaningful performance difference on my end between the two.

I think we should merge this in @dmarcos , @mrxz ! WDYT? This should really help with #225 .

I don't see any reason not to merge this.

Regarding #225 it will most definitely reduce the time a stale ordering is visible on screen. Whether or not this will be enough is the question. Similar to the pesky FOUC in web-design, no matter how short it will remain an "issue". Not sure what mkkellogg did differently, but it at the very least it degenerated more gracefully. It does have logic to detect large camera changes and queues partial sorts, though even before a sort update arrives, the rendered frame somehow looks less bad (subjectively).

@dmarcos

dmarcos commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Thanks everyone. Good stuff!

@dmarcos dmarcos merged commit 6d24120 into sparkjsdev:main Jun 8, 2026
2 checks passed
seacloud9 added a commit to seacloud9/spark that referenced this pull request Jun 12, 2026
Surveys current render hot paths, captures the upstream-fork
divergence (90 ahead / 5 behind, three perf-relevant commits
missing), and proposes a 5-phase plan to instrument, measure,
test, and optimize.

Headline finding from the divergence audit: upstream commit
`6d24120 make sort32 fast (sparkjsdev#327)` is a Rust-side branchless
WASM hot-loop optimization in `rust/spark-rs/src/sort.rs`
(+119 / -40 lines) that our fork does not have. Highest-priority
cherry-pick candidate before any net-new perf work.

WHY:

Two simultaneous gaps that the multi-backend rollout surfaced:

1. **No perf measurement infrastructure exists today.** The repo
   has correctness gates (27-scene parity matrix, 36-example
   smoke loop, render-cube-depth interaction smoke) but zero
   benchmark, FPS budget, memory-growth, or regression-detection
   tests. Every claim about "this is faster / slower" is
   inferential.

2. **The fork has diverged 90 commits from upstream**
   `sparkjsdev/spark` without a structured way to compare hot
   paths. Some of those 90 commits added per-frame work
   (Babylon texture-bridge readback, dual-SparkRenderer in
   aframe mode); without telemetry we cannot tell where the
   cost lands.

This plan is fork-local. Upstream is configured fetch-only
(`upstream  push = no_push`) so a stray `git push upstream`
errors before doing anything. We may cherry-pick FROM upstream
into the fork; we never push TO upstream.

WHAT:

`docs/RENDER-PERF-PLAN.md` is a new doc covering:

- **§1 Upstream-fork relationship.** The two remotes
  (`origin` = our fork, `upstream` = canonical, fetch-only),
  the 90 / 5 divergence numbers, and the per-commit breakdown
  of the five upstream commits we're behind on:
  - `6d24120 make sort32 fast` — perf, cherry-pick.
  - `78bc65e SplatPager dataReady fix` — possibly perf, evaluate.
  - `69688c0 Infer encodeLinear from render target` — refactor.
  - `2e7d9e0` Rust warning cleanup — skip unless cosmetic
    cleanup is wanted.
  - `63c6d6a` Rust API rename
    (`set_max_sh_degree` → `clamp_sh_degree`) — semantics
    change, skip unless we want the new clamping behavior.

- **§2 Known hot paths.** Five identified, each with file +
  line citation and the cost characteristic:
  - Babylon texture-bridge per-frame GPU readback (~5 MB/frame
    on hello-world spz, scales linearly with target dims).
  - Three-pass scene traversal in `collectThreeSparkScene`
    (`traverse + traverseVisible + traverseVisible` every
    accumulator update).
  - Splat-sort timing measured but commented out at
    SparkRenderer.ts:1100.
  - `lastTraverseTime` / `lastLodRaycastTime` stored as state
    but never exposed through an API.
  - Native-mode Babylon's manual `onBeforeRender` invocation
    in `SparkBabylonMesh.syncOnce` (required for correctness,
    extra per-frame work that texture-mode avoids).

- **§3 Known fork-local source modifications.** The two
  uncommitted-modification files (`SplatAccumulator.ts` +
  `utils.ts`) and what the multi-backend rollout did to them.

- **§4 Phased plan** — five phases, each with a time estimate:
  - **Phase 0 (DONE):** upstream wired, divergence measured.
  - **Phase 0.5 (≤ 1h):** cherry-pick `6d24120` +
    `78bc65e` + `69688c0` into a local branch, rebuild WASM,
    run parity matrix, merge if green. Free perf win.
  - **Phase 1 (≤ 2h):** instrument the existing timers
    through a single `SparkRenderer.perfMetrics` getter
    (`lastSortMs`, `lastTraverseMs`, `lastLodRaycastMs`,
    `lastAccumulateMs`, `lastReadbackMs`, `lastFrameMs`).
    Zero overhead when unread.
  - **Phase 2 (≤ 4h):** add a new `test/perf/` directory with:
    - `render-fps.spec.ts` — Playwright-driven page that runs
      600 frames per fixture + backend pair, records
      `perfMetrics`, asserts a budget (Three / native Babylon
      ≤ 16.67 ms p50, Babylon texture ≤ 18 ms p50, A-Frame
      same as Three).
    - `memory-growth.spec.ts` — heap-size delta from frame
      60 to 600.
    - `regression-baseline.json` — committed last-green
      numbers; CI re-runs assert within ±15% time / ±20%
      memory.
    - `pnpm run test:perf` script wrapping
      `playwright test test/perf/`.
  - **Phase 3 (post-instrumentation):** targeted optimizations
    in priority order, each gated on a Phase 2 measurement:
    1. Shared GL context for Babylon texture mode (eliminates
       readback; texture-bridge file line 95 already specs it).
    2. Fold three-pass scene traversal into one pass.
    3. Reuse scratch buffers across resize boundaries in
       the texture bridge.
    4. Profile `SplatAccumulator.update()` body.
    5. Worker reuse + transfer-only ArrayBuffer messaging for
       sort.
  - **Phase 4:** CI gating — wire `test:perf` into
    `.github/workflows/` (or PR-label-gated until baselines
    stabilize).

- **§5 Out of scope** (for now): WebGPU, worker pool, custom
  shader optimization, mobile tuning.

VERIFICATION:

- Doc-only commit; no code changes.
- `git remote -v` confirms upstream is fetch-only:
    `upstream  https://github.com/sparkjsdev/spark.git  (fetch)`
    `upstream  no_push                                   (push)`

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
seacloud9 pushed a commit to seacloud9/spark that referenced this pull request Jun 12, 2026
* make sort32 faster

* remove more branches in wasm hot loops

* cleanup

* Minimize changes on branch to original `sort.rs` while retaining all the essential optimizations. Removed second branchless optimization. Added comments on why `unsafe` accesses are okay.

---------

Co-authored-by: Andreas Sundquist <asundqui@worldlabs.ai>
seacloud9 added a commit to seacloud9/spark that referenced this pull request Jun 12, 2026
…erf section

Second targeted-interaction smoke following the
render-cube-depth / raycasting template. Gates the pointer →
raycast → uniform-update pipeline on the interactive-ripples
example across Three / A-Frame / Babylon. Also documents the
two upstream perf commits absorbed in this session
(`b4804c8 make sort32 fast`, `2f51875 Set dataReady false for
textures arrays in SplatPager`) plus the aframe
dual-SparkRenderer fix from `d713d90` in CHANGELOG.

WHY:

The interaction-smoke template established by `a3db749`
(render-cube-depth) and `d713d90` (raycasting) pays off here.
interactive-ripples is the next-most-fundamental Tier 7 example
— it exercises the SAME pipeline (env.canvas pointer event →
raycaster.setFromCamera → intersectObject → uniform write)
on a SINGLE static SplatMesh (valley.spz), so any cross-engine
divergence shows up immediately. The valley fills most of the
viewport at the default camera, which makes hits deterministic
unlike raycasting's orbital robots.

Without the CHANGELOG perf entry, the two cherry-picks plus the
dual-spark fix would not appear in the next release notes.

WHAT:

`examples/interactive-ripples/main.js`:
- Body dataset flags:
  - `data-ripple-ready="true"` after `valley.initialized`
    resolves.
  - `data-ripple-clicks=N` on every `pointerdown` event.
  - `data-ripple-hits=N` on every successful SplatMesh hit.
  - `data-ripple-last-hitpoint="x,y,z"` (3-decimal precision)
    captures the local-space hit point fed into
    `hitpointUniform.value`.
- The pre-existing behavior (raycast → localPoint →
  hitpointUniform.value.copy → timeCounter = 0 → shader runs
  the shockwave) is unchanged. Counters bump alongside.

`test/e2e/multibackend-smoke.spec.ts`:
- New per-engine test
  `interactive-ripples click delivers ripples on
  engine=<engine>`.
- Test flow mirrors the raycasting template:
  1. goto + wait for engine-switcher.
  2. Wait for `data-ripple-ready="true"`.
  3. Click 3 times around viewport center (the valley fills
     most of the screen — 3 clicks reliably hit on every
     engine).
  4. Hard-assert `data-ripple-clicks="3"`.
  5. Soft-observe `data-ripple-hits` + `data-ripple-last-hitpoint`
     via console.log.
- Test budget: 360s (same as raycasting, leaves headroom for
  aframe even though it should be fast now that the dual-spark
  fix has shipped).
- Same Vite HMR WebSocket reconnect filter as the raycasting
  test.

`CHANGELOG.md`:
- NEW "### Performance" subsection under `## Unreleased`,
  covering:
  - `make sort32 fast (sparkjsdev#327)` cherry-picked from
    `sparkjsdev/spark` upstream (commit `6d24120`, author Ali
    Milhim, co-authored by Andreas Sundquist).
  - `Set dataReady false for textures arrays in SplatPager
    (sparkjsdev#358)` cherry-picked from upstream (commit `78bc65e`,
    author Noeri Huisman).
  - The aframe dual-SparkRenderer fix in
    `examples/js/spark-engine.js setupAframeBackend`.
  - `docs/RENDER-PERF-PLAN.md` and the fetch-only upstream
    remote configuration (`upstream push = no_push`).

VERIFICATION:

- `pnpm exec playwright test test/e2e/snapshot.spec.ts
  --grep "scene: helloWorld"` — 8/8 PASS in 6.3m after the
  sort32-fast + SplatPager dataReady cherry-picks + WASM
  rebuild. All four backend pairs (three / aframe / babylon /
  babylon-native) bit-perfect at 0 / 786432 pixels differ. The
  aframe capture wall time dropped from ~2m to 1.1m post
  dual-spark fix — separately measurable perf win that
  validates the fix beyond just "test no longer hangs".
- `pnpm exec playwright test test/e2e/multibackend-smoke.spec.ts
  --grep "interactive-ripples click delivers ripples"` —
  3-engine smoke running; result captured in a follow-up if
  the test surfaces any per-engine divergence.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants