Skip to content

bench(tableau): branch-coalesce scaling — sort-merge vs FxHashMap A/B#156

Merged
Roger-luo merged 4 commits into
mainfrom
bench/branch-coalesce-scaling
Jun 24, 2026
Merged

bench(tableau): branch-coalesce scaling — sort-merge vs FxHashMap A/B#156
Roger-luo merged 4 commits into
mainfrom
bench/branch-coalesce-scaling

Conversation

@Roger-luo

@Roger-luo Roger-luo commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator
branch_coalesce_scaling

Summary

Follow-up study for #154, which replaced the FxHashMap coalesce in the T-gate
hot path (GeneralizedTableau::branch_with_coefficients) with a sort-merge and
measured ~10× on cultivation_d5. That win was found on one circuit. This bench
answers the open questions head-on:

Does the sort-merge advantage persist as the branch count m grows, and is
there a regime where the hash coalesce wins again?

Because #154 deleted the hash path from the default build (it survives only behind
rayon), there's no way to A/B the two through the public gate API — so the bench
reimplements both coalesce routines as faithful free functions:

Both consume identical real input: a coefficient vector grown to exactly
m = 2^k by k branching T gates on an 80-qubit u128 tableau, plus the genuine
decomposition of the next T gate. verify_equivalence asserts the two produce the
same coefficient set before any timing, so a drifted port fails loudly.

k branching T gates → exactly m = 2^k branches (T gates touch only the
coefficient vector, never the tableau), so the swept axis is the T-gate count.
40 untruncated branching T gates would be 2^40 ≈ 10^12 branches, out of reach for
any coalesce — so the honest variable is m. Two collision regimes are measured:

  • doubling — the next T flips a fresh index bit (output 2m, zero merges);
    the canonical per-T-gate cost.
  • merge — the next T flips a bit the set is already closed under (output m,
    all collisions); the flavour of the measurement case-a path.

Result

The #154 win persists and grows at scale in the doubling regime; the hash
coalesce wins back the collision-heavy regime.

m doubling sort-merge speedup merge sort-merge speedup
4 1.15× 1.00×
256 1.94× 1.18×
2 048 1.41× 0.82× (hash wins)
16 384 1.49× 0.90×
65 536 1.63× 0.57× (hash wins)
262 144 3.41× 0.65×
1 048 576 3.83× 0.83×

speedup = t_hashmap / t_sortmerge (>1 sort-merge wins, <1 hash wins). Medians, 80q / u128 index.

Why. In doubling the 2m output keys are all distinct: the hashmap does 2m
random probes into a 2m-entry table and hits a cache cliff once it outgrows L3
(8.4× slower for 4× more work between m=64K→256K), while sort-merge stays
bandwidth-bound and scales linearly — exactly the "gap widens with scale" claim in
#154, confirmed to 3.8×. In merge only m keys are distinct: the table stays
half-size and hot, entry() coalesces-on-insert for free, and sort-merge's
O(m log m) sort becomes pure overhead for an m-size output, so hash wins for
m ≳ 2K.

📈 Scaling plot (left: time vs m, log-log; right: sort-merge speedup with the
crossover line and "hash wins" band) is at benchmarks/branch_coalesce_scaling.png
— rendered locally and attached below; not checked in, per the benchmarks/
convention.

Actionable follow-up

#154 also applied sort-merge to the measurement case-a coalesce in
measure.rs, which is collision-heavy (projection roughly halves the set) — i.e.
the merge regime, where this bench shows the hash coalesce is up to ~1.75× faster
at large m. Worth checking whether case-a should keep (or revert to) the hash
coalesce; the harness extends naturally to model that path directly.

Reproduce

cargo bench -p ppvm-tableau --bench branch-coalesce-scaling   # PPVM_BRANCH_MAX_EXP=22 to push higher
uv run --with matplotlib python benchmarks/plot_branch_coalesce.py \
  --out benchmarks/branch_coalesce_scaling.png

Files

  • crates/ppvm-tableau/benches/branch-coalesce-scaling.rs — the A/B bench.
  • benchmarks/plot_branch_coalesce.py — renders the plot straight from criterion's estimates.json.
  • benchmarks/README.md, Cargo.toml — doc section + bench registration.

🤖 Generated with Claude Code

Follow-up study for #154, which replaced the FxHashMap coalesce in the
T-gate hot path (branch_with_coefficients) with a sort-merge. This bench
answers the open question: does the win persist as the branch count m
grows, and where does the hash coalesce win again?

#154 deleted the hash path from the default build, so the bench
reimplements both coalesce routines as faithful free functions (the
sort-merge keeps both the u64-packed fast path and the generic fallback;
the hashmap mirrors branch_coefficients_seq), asserts them equivalent on
real input at start-up, then drives both with identical coefficient
vectors grown to m = 2^k on an 80-qubit u128 tableau. Two collision
regimes: doubling (fresh bit, output 2m — the per-T-gate cost) and merge
(closed set, output m — the measurement case-a flavour).

Findings (medians, 80q/u128):
- doubling: sort-merge wins throughout, gap widens 1.1x (m=4) -> 3.8x
  (m=2^20) as the hash table outgrows L3 and goes cache-miss-bound.
- merge: the hash coalesce overtakes for m >~ 2K (up to ~1.75x at
  m=2^16), where dense collisions make the O(m log m) sort pure overhead.

Adds benchmarks/plot_branch_coalesce.py, which renders the scaling plot
(time vs m, and sort-merge speedup with the crossover band) straight from
criterion's estimates.json. Rendered PNG stays untracked per convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Roger-luo Roger-luo requested a review from david-pl June 24, 2026 05:00
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-24 15:52 UTC

@Roger-luo Roger-luo enabled auto-merge (squash) June 24, 2026 15:45
@Roger-luo Roger-luo merged commit f036796 into main Jun 24, 2026
10 of 11 checks passed
@Roger-luo Roger-luo deleted the bench/branch-coalesce-scaling branch June 24, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants