perf(bb/msm): Jacobian (S, W) tree bucket reduction#23526
Draft
AztecBot wants to merge 7 commits into
Draft
Conversation
bucket_result stores (0,0) for empty buckets; the JBR formulas were feeding those non-points into mmadd/madd/add-2007-bl and propagating garbage through the tree. Adds per-node `meta = is_present | unitp<<1`: empty leaves fall through case (0,0); case (1,0)/(0,1) at round 0 lift the single bucket to Jacobian with unitp = 1/2. Round r >= 1 case (0,1) needed extra care — when the right child is a single-bucket subtree with R.unitp == h (= 2^r), `h*R.S` and `R.W` are both the Jacobian form of the SAME 2^r·B[k] and the standard jacAdd hits the doubling case. Detect via meta and shortcut to jacDouble(R.W); the standard formula remains safe everywhere else (multi-bucket R mixes distinct generators; unit R with R.unitp != h gives distinct group elements). Host-side hostWindowCombine now uses jacAddSafe / jacDoubleSafe + a JAC_INF sentinel so all-empty windows propagate as point-at-infinity through the Horner. `jbr_reference.test.mjs` extended with empty-bucket cases (top-empty, first-empty, sparse-top, sparse-mid, alternating, all-empty). sparse-mid exercises the unitp==h collision chain that the BrowserStack run originally exposed.
#2 from the audit: manually inline jac_add / jac_double in the AA->J and JJ->J shaders, break the case-(1,1) hot path into scoped stages, and defer loads to first use. WGSL doesn't guarantee that function calls with array<u32,8> by-value parameters inline cleanly; the previous shader passed 6 such arrays per jac_add call and three jacAdd calls per merge, easily exceeding the per-thread vector-register budget on Adreno (and squeezing it on M2). The JJ hot path now has four scoped stages — S, doublings, W_tmp, W_new — with sl*/sr*/wl*/wr* loaded inside the stage that consumes them. Stage outputs that bridge stages (dx/dy/dz, wtx/wty/wtz) are outer-scope vars rewritten in place; stage-internal intermediates fall out of the live set at the closing brace. #1 from the audit: pickReduceWg is now a flat 128 regardless of c. The old c-tiered table (32/64/128) was tuned for the batch-affine kernel where workgroup size capped at 2^(c-1); the new flat-tree dispatch doesn't have that constraint, and 128-thread WGs occupy a core fully without leaving simdgroups idle on late, sparse rounds. Reference test (jbr_reference.test.mjs) unchanged and still passes. GPU correctness was last verified on Apple M2; an S25 (Adreno 750) bench follows. Also brings the autorun=msm-gpu-bench page mode in dev/msm-webgpu so the next sweep can be driven from a single BrowserStack URL.
…iler limit The previous commit inlined the Jacobian formulas straight into the AA and JJ merge kernels. On the Apple M2 this compiles fine; on a Samsung Galaxy S25 (Adreno 750 / Snapdragon 8 Elite) the Vulkan driver returns VK_ERROR_UNKNOWN from CreateComputePipelines — the post-mustache shader exceeds Adreno's per-kernel size/code-cache budget. Reverting both shaders to the function-call form; pickReduceWg flat 128 stays. S25 with the functional form runs the bench end-to-end at logN=14, c=8: wall_min 21.4 ms / wall_mean 29.3 ms (thermal throttling visible across the 20-sample run). redLevel = 2.08 ms (min), redInit = 0.07 ms (min). Will attack register pressure via the field-element representation instead — vec4<u32>×2 in place of array<u32,8> doesn't grow the shader text, and it gives the compiler 2 vector-register slots per field instead of 8 scalar slots.
…loads
Two pieces aimed at reducing per-thread register pressure and the
per-round kernel-dispatch overhead:
1. jbr_window_coop kernel — for c ≤ 8 the per-window (S, W) tree fits
N/2 ≤ 64 nodes × 196 B per node = 12.5 KiB inside a workgroup's
threadgroup memory (under the 16 KiB WebGPU spec minimum). One
workgroup owns one window: thread tid does the AA→J pair-merge,
workgroupBarrier, then each subsequent JJ→J sub-round halves the
active thread count with another barrier in between. Replaces the
~7 individual dispatches the c=8 redLevel previously needed with a
single dispatch (workgroup_size = N/2, dispatch_count = NW).
MsmV2.create gates on c ≤ 8; larger c stays on the multi-dispatch
JJ path, which would breach the 16 KiB TG budget.
2. JJ shader — defer wl* / wr* loads to the scoped block that
consumes them. With every load up-front the wl* and wr* fields
stayed live throughout the S-stage jac_add, inflating the
per-thread peak live-set. Loading inside `{}` so the compiler can
drop the temporaries at scope close. (Inlining failed Adreno's
shader compiler last attempt — staying functional here.)
Both changes preserve the existing case-split for empty buckets and
single-bucket subtree positions; the jbr_reference.test.mjs cases
(sparse-top/mid, alternating, all-empty) still pass.
Apple M2 (chrome 148) noble check at logN=14 still matches.
Samsung S25 (Adreno 750) bench pending.
S25 spotted: 'active' is reserved in current WGSL. Trivial rename of the loop counter in jbr_window_coop.
S25 bench (logN=14, c=8) ran the WG-coop kernel and hit wall_min 293 ms (redInit alone 257 ms) vs 21.5 ms baseline. The combined single-dispatch shader with 6 sub-rounds + 6 case-splits compiles cleanly but Adreno's register allocator clearly spills almost everything to global memory. Keep the shader + plumbing; just gate the host pipeline compilation off (). The multi-dispatch JJ path remains the active code path for every c. The deferred wl* / wr* loads from the previous commit benefit that path directly and weren't being exercised because c=8 was taking the WG-coop branch. EOF
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The previous bucket-reduction stage runs one workgroup per window with
batch-affine Montgomery-trick adds. At small
c(e.g.c=8) eachwindow has only
2^(c-1)=128buckets to drain and the workgroup ismostly idle, while batch-affine forces each thread to gather
S=8points before any work happens.
This PR replaces that stage with a fully parallel Jacobian tree
reduction that treats all
NW * 2^(c-1)weighted buckets as one blockof work and dispatches every merge in parallel. Each thread does a
small constant-time merge with no batched inversion and no
S-point gather, so GPU saturation no longer depends on per-windowfan-out.
Algorithm
For each window
w, each tree node summarises a contiguous range ofbuckets as a pair
(S, W):S = Σ B[k]over the range,W = Σ (pos · B[k])withpos = 1..hrelative to the range start.Merging two adjacent (S_L, W_L) and (S_R, W_R) of size
heach intoone (S, W) of size
2h:At the root,
WisL_w = Σ k · B_w[k]fork = 1..N. Per window thetotal is ≈
5N/2adds + ≈Ndoublings — fewer rounds than the old4-phase reduction and with full inter-window parallelism throughout.
Round 0 (
AA -> J, leaf) and rounds1..c-2(JJ -> J) areimplemented as separate shaders, because the first level can use the
cheaper mmadd/madd formulas (Z = 1 both inputs) and skip Jacobian
loads entirely. As required by the design, neither shader checks the
x1 == x2case — SRS bases are randomly-independent generators andcoordinate collisions in the reduction are vanishingly improbable.
Layout
(S, W)pairs live in a 6-plane SoA buffer (S.X, S.Y, S.Z, W.X, W.Y, W.Z).dedicated
jbrFinalBufwhose plane stride equalsNUM_WINDOWS, sothe per-window L_w gather is a contiguous 3-plane copy into
redStaging.NWJacobian points and Horner-combines them inJacobian (one inversion at the very end). The C++ bridge consumer
still sees per-window affine sums via
windowSums(one inversionper window) for
combineOnHost = false.Files
wgsl/cuzk/jbr_aa_to_jj.template.wgsl— round 0 (AA -> J leaf merge)wgsl/cuzk/jbr_jj_to_jj.template.wgsl— rounds 1..c-2 (JJ -> J merge)cuzk/shader_manager.ts— two newgen_jbr_*shader buildersmsm_v2.ts— replacesreduceInit+reduceLevel*dispatch withthe AA->J + (c-2)×JJ->J schedule, rewires the L_w gather to the
Jacobian buffer, and converts
hostWindowCombineto take Jacobianinputs (the existing Horner already ran in Jacobian internally).
jbr_reference.test.mjs— standalone bigint reference that verifiesthe (S, W) merge formula matches
Σ k · B[k]forc = 2..10.dev/msm-webgpu/main.ts+scripts/run-browserstack.mjs— adds the?autorun=msm-gpu-noblemode (and an--autorunflag on thedriver) so a WebGPU-only correctness check against noble can run on
BrowserStack without paying for the bb.js WASM bootstrap.
Test plan
node src/msm_webgpu/jbr_reference.test.mjspasses for c=2..10(bigint reference matches a naive
Σ k · B[k]MSM).?autorun=msm-gpu-noble&logn=14on aBrowserStack macOS Sequoia / Chrome target (M2 Mini).
?autorun=msm-cross-checkonce a slot frees up to measurec=8andc=13against the existing batch-affine reduction.Created by claudebox · group:
slackbot