bench(tableau): branch-coalesce scaling — sort-merge vs FxHashMap A/B#156
Merged
Conversation
Follow-up study for #154, which replaced the FxHashMap coalesce in the T-gate hot path (branch_with_coefficients) with a sort-merge. This bench answers the open question: does the win persist as the branch count m grows, and where does the hash coalesce win again? #154 deleted the hash path from the default build, so the bench reimplements both coalesce routines as faithful free functions (the sort-merge keeps both the u64-packed fast path and the generic fallback; the hashmap mirrors branch_coefficients_seq), asserts them equivalent on real input at start-up, then drives both with identical coefficient vectors grown to m = 2^k on an 80-qubit u128 tableau. Two collision regimes: doubling (fresh bit, output 2m — the per-T-gate cost) and merge (closed set, output m — the measurement case-a flavour). Findings (medians, 80q/u128): - doubling: sort-merge wins throughout, gap widens 1.1x (m=4) -> 3.8x (m=2^20) as the hash table outgrows L3 and goes cache-miss-bound. - merge: the hash coalesce overtakes for m >~ 2K (up to ~1.75x at m=2^16), where dense collisions make the O(m log m) sort pure overhead. Adds benchmarks/plot_branch_coalesce.py, which renders the scaling plot (time vs m, and sort-merge speedup with the crossover band) straight from criterion's estimates.json. Rendered PNG stays untracked per convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
david-pl
approved these changes
Jun 24, 2026
…-scaling # Conflicts: # crates/ppvm-tableau/Cargo.toml
…nto bench/branch-coalesce-scaling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up study for #154, which replaced the
FxHashMapcoalesce in the T-gatehot path (
GeneralizedTableau::branch_with_coefficients) with a sort-merge andmeasured ~10× on
cultivation_d5. That win was found on one circuit. This benchanswers the open questions head-on:
Because #154 deleted the hash path from the default build (it survives only behind
rayon), there's no way to A/B the two through the public gate API — so the benchreimplements both coalesce routines as faithful free functions:
coalesce_sortmerge— verbatim port of the sequential sort-merge inbranch_with_coefficients, keeping both theu64-packed fast path and thegeneric
(I, u32)fallback.coalesce_hashmap— the pre-perf(tableau): sort-merge branch & measurement coalesce (~10× on cultivation_d5) #154FxHashMapcoalesce (mirrorsbranch_coefficients_seq).Both consume identical real input: a coefficient vector grown to exactly
m = 2^kbykbranching T gates on an 80-qubitu128tableau, plus the genuinedecomposition of the next T gate.
verify_equivalenceasserts the two produce thesame coefficient set before any timing, so a drifted port fails loudly.
kbranching T gates → exactlym = 2^kbranches (T gates touch only thecoefficient vector, never the tableau), so the swept axis is the T-gate count.
40 untruncated branching T gates would be 2^40 ≈ 10^12 branches, out of reach for
any coalesce — so the honest variable is
m. Two collision regimes are measured:2m, zero merges);the canonical per-T-gate cost.
m,all collisions); the flavour of the measurement case-a path.
Result
The #154 win persists and grows at scale in the doubling regime; the hash
coalesce wins back the collision-heavy regime.
speedup = t_hashmap / t_sortmerge (>1 sort-merge wins, <1 hash wins). Medians, 80q / u128 index.
Why. In doubling the
2moutput keys are all distinct: the hashmap does2mrandom probes into a
2m-entry table and hits a cache cliff once it outgrows L3(8.4× slower for 4× more work between m=64K→256K), while sort-merge stays
bandwidth-bound and scales linearly — exactly the "gap widens with scale" claim in
#154, confirmed to 3.8×. In merge only
mkeys are distinct: the table stayshalf-size and hot,
entry()coalesces-on-insert for free, and sort-merge'sO(m log m)sort becomes pure overhead for anm-size output, so hash wins form ≳ 2K.
Actionable follow-up
#154 also applied sort-merge to the measurement case-a coalesce in
measure.rs, which is collision-heavy (projection roughly halves the set) — i.e.the merge regime, where this bench shows the hash coalesce is up to ~1.75× faster
at large
m. Worth checking whether case-a should keep (or revert to) the hashcoalesce; the harness extends naturally to model that path directly.
Reproduce
cargo bench -p ppvm-tableau --bench branch-coalesce-scaling # PPVM_BRANCH_MAX_EXP=22 to push higher uv run --with matplotlib python benchmarks/plot_branch_coalesce.py \ --out benchmarks/branch_coalesce_scaling.pngFiles
crates/ppvm-tableau/benches/branch-coalesce-scaling.rs— the A/B bench.benchmarks/plot_branch_coalesce.py— renders the plot straight from criterion'sestimates.json.benchmarks/README.md,Cargo.toml— doc section + bench registration.🤖 Generated with Claude Code