Skip to content

feat(cuda): keep LDE and Merkle trees resident on the GPU#748

Open
ColoCarletti wants to merge 24 commits into
mainfrom
gpu_integration
Open

feat(cuda): keep LDE and Merkle trees resident on the GPU#748
ColoCarletti wants to merge 24 commits into
mainfrom
gpu_integration

Conversation

@ColoCarletti

Copy link
Copy Markdown
Collaborator

Summary

Keeps the STARK prover's LDE data and Merkle trees resident on the GPU across proving rounds instead of copying each full tree back to host. The prover now copies only the 32 byte root to host and gathers Merkle opening paths directly on device. Output is byte identical to the CPU path; on an ethrex 5 tx block this is about 7.6% faster end to end (17.30s -> 15.98s on an RTX 5090).

Why

The old GPU path built each Merkle tree on device, then copied the whole node array back to host, rebuilt a host side tree, and gathered proofs on CPU. For one prove that is about 1.11 GiB of device to host copies plus the host side tree work. Keeping the tree resident removes all of that. The win scales with trace size (the copy is proportional to tree size).

What changed

GPU residency

  • LDE columns and Merkle trees stay on device from round 1 to round 4, held by a per table GpuTableSession (main trace, aux trace, composition parts, plus the bound stream).
  • New GpuMerkleTree { nodes: Arc<CudaSlice>, leaves_len, root } holds the resident tree; only the root is copied to host.
  • from_root builds a root only host tree so the host proof object commitment without the full node array.

Device side proof gathering

  • merkle_gather_paths CUDA kernel plus gather_proofs_dev gather paths on device. A parity test asserts they match host get_proof_by_pos byte for byte.
  • try_fri_query_phase_gpu runs the FRI query openings against resident layer trees; composition parts fold on device and the composition tree to round 4.

Scheduling

  • VRAM budgeted chunking (estimate_table_vram_bytes, `plan_table_ bounds co resident tables. With no budget (non cuda, or budget not binding) it falls back to the previous fixed size scheme, so the CPU path

Rebased onto main's commitment rework

  • Adapts to row major LDE and row pair Merkle leaves (ROWS_PER_LEAF = 2): one proof per query authenticates both rows, no proof_sym.

Instrumentation

  • Optional wall clock span timeline behind the instruments feature. Zero overhead and no behavior change when the feature is off.

Correctness and validation

  • Byte identical proofs on GPU and CPU paths (the only difference Merkle proof comes from, and that source is parity tested).
  • End to end GPU prove and verify pass (cuda_path_integration), row pair commitment verification test.
  • All math-cuda kernel parity tests and the 137 stark tests pass.
  • Measured on an RTX 5090: ethrex 5 tx prove about 7.6% faster, about 1.11 GiB of tree device to host copies removed per prove.

Notes

  • Non cuda builds are behaviorally unchanged.
  • A transient GPU error during opening aborts the proof rather than silently falling back, since the host tree is root only.

@ColoCarletti

Copy link
Copy Markdown
Collaborator Author

/ai-review
/bench

@github-actions

Copy link
Copy Markdown

Codex Code Review

No issues found in the PR diff.

I reviewed the changed CUDA Merkle residency/proof gathering, FRI query path, GPU session handoff, VRAM chunking, and instrumentation changes statically. I did not build or run tests per the review constraints.

Comment thread crypto/math-cuda/src/device.rs
Comment thread crypto/stark/src/gpu_lde.rs
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review: keep LDE and Merkle trees resident on the GPU

Reviewed for safety, correctness, performance, and readability. This is a solid, well-structured change — the residency model, root-only host trees, and device-side path gathering are coherent and the parity tests are the right gate. I verified the correctness-sensitive parts directly:

  • FRI query gather (try_fri_query_phase_gpu): opened position (iota>>l)>>1, symmetric eval evaluation[index^1], and per-layer auth-path ordering all match host query_phase exactly. Empty/CPU-commit layers correctly return None and fall back to the host walk.
  • Trace/composition gather: device gather position index matches host get_proof_by_pos(index) (row-pair leaves, one proof authenticates both rows), and open_*_with_proof builds evaluations identically to the host versions.
  • merkle_gather_paths kernel: sibling/parent index math mirrors CPU build_merkle_path; max sibling stays within 2*leaves_len-1, and the positions < leaves_len assert prevents OOB device reads. The u64 reinterpret is 32-byte aligned. Parity-tested byte-for-byte.
  • Stream safety: every resident tree is built on a stream that is synchronized before the handle is stored, so later gathers on a different (bound or pool) stream are race-free.
  • Scheduling: plan_table_chunks correctly admits at least one (oversized→solo) table per chunk and falls back to fixed-size k when budgeting is disabled; main/peak chunk plans each produce-and-consume their own results in order.

Two non-blocking notes left inline:

  1. Medium — default mempool release threshold is u64::MAX (retains all freed device blocks indefinitely); on a long-lived multi-prove process with varying trace sizes this can grow VRAM via fragmentation. The 80% budget bounds concurrent working set, not retained-pool growth. Mitigated by LAMBDA_VM_MEMPOOL_RELEASE_MB.
  2. Low — R4 opening / FRI query .expect() on transient GPU errors panics instead of returning ProvingError. Documented and intentional (no host fallback once keep path runs).

The instruments-gated timeline is zero-overhead when the feature is off and doesn't affect proof output. No correctness or memory-safety issues found in the hot paths.

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

AI Review

PR #748 · 19 changed files

Findings

Status Sev Location Finding Found by
candidate medium crypto/crypto/src/merkle_tree/merkle.rs:260 Debug-only assertion on root-only MerkleTree in get_proof_by_pos leaks to release builds minimax
minimax/MiniMax-M3
candidate low crypto/math-cuda/src/merkle.rs:381 Stale doc-comment orphaned on build_comp_poly_tree_nodes_dev after removing build_comp_poly_tree_from_evals_ext3 minimax
minimax/MiniMax-M3
candidate low crypto/stark/src/gpu_lde.rs:1449 positions: &[usize] → &[u32] truncation in gather_proofs_dev is silent minimax
minimax/MiniMax-M3
candidate low crypto/stark/src/gpu_lde.rs:1702 try_fri_query_phase_gpu only checks the first layer's gpu_tree before falling back minimax
minimax/MiniMax-M3
candidate low crypto/stark/src/gpu_lde.rs:1705 GPU FRI gather runs on a pool stream different from the one that built the tree minimax
minimax/MiniMax-M3
candidate low crypto/stark/src/instruments.rs:10 Instrument span tree mixes wall-clock and per-thread CPU time without documentation in the public API minimax
minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Debug-only assertion on root-only MerkleTree in get_proof_by_pos leaks to release builds
  • Status: candidate
  • Severity: medium
  • Location: crypto/crypto/src/merkle_tree/merkle.rs:260
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The new from_root constructor returns a MerkleTree with empty nodes; calling get_proof_by_pos on it in release builds silently returns Some(Proof{ merkle_path: vec![] }) because the new debug_assert!(!self.nodes.is_empty(), …) is compiled out. If any code path falls through to the host tree while gpu_composition_tree/gpu_main.tree/gpu_aux.tree was already moved/dropped, the proof bundle ships with empty merkle_paths and verification will fail (or, worse, succeed silently if the verifier accepts short proofs).

Evidence

Lines 256-269 of merkle.rs show the new debug_assert! is the only guard against misuse; in release this assertion is a no-op, and build_merkle_path walks while pos != ROOT which terminates immediately when node_count()==0 (the tree_depth = (0+1).ilog2() == 0 and the loop runs zero times), so an empty merkle_path is returned and wrapped in Some. The doc-comment on from_root even states "get_proof_by_pos must NOT be called on it", but release builds do not enforce that.

Suggested fix

Change debug_assert!(!self.nodes.is_empty(), ...) to assert!(!self.nodes.is_empty(), "get_proof_by_pos called on a root-only MerkleTree") so release builds also panic, OR have get_proof_by_pos return None on root-only trees and update callers (open_composition_poly, open_polys_with) to surface that as a hard error instead of unwrap(). Either fix prevents a silent empty-path proof.

AI-002: Stale doc-comment orphaned on build_comp_poly_tree_nodes_dev after removing build_comp_poly_tree_from_evals_ext3
  • Status: candidate
  • Severity: low
  • Location: crypto/math-cuda/src/merkle.rs:381
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The block-comment describing build_comp_poly_tree_from_evals_ext3 (the now-removed function returning Vec&lt;u8&gt;) is attached to build_comp_poly_tree_nodes_dev (the new internal helper). The two functions have different signatures and returns — readers expect the leading comment to describe the function it precedes.

Evidence

Lines 381-390 of merkle.rs read /// Row-pair Keccak leaf + Merkle tree build for R2 composition-polynomial commit. … /// Build the composition Merkle tree on device. Leaves hash row pairs, so … — the first paragraph is the old pub doc, the second is the new fn doc, both stuck on fn build_comp_poly_tree_nodes_dev(...). The function returns (CudaSlice&lt;u8&gt;, usize, Arc&lt;CudaStream&gt;); the orphaned comment claims it returns (2*(lde_size/2) - 1) * 32 bytes of tree nodes.

Suggested fix

Replace the orphaned block with a single doc-comment that describes build_comp_poly_tree_nodes_dev and its (nodes_dev, num_leaves, stream) return shape. The orphan adds reader confusion in an already dense file.

AI-003: positions: &[usize] → &[u32] truncation in gather_proofs_dev is silent
  • Status: candidate
  • Severity: low
  • Location: crypto/stark/src/gpu_lde.rs:1449
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

gather_proofs_dev casts positions: &amp;[usize] to Vec&lt;u32&gt; via as u32 truncation. For an LDE size > 2^32 (currently impossible because of the assert_u32_domain check on lde_size), iotas would silently wrap and gather proofs for wrong leaves. The cast is silent; no debug_assert catches the truncation in release.

Evidence

Line 1449: let positions_u32: Vec&lt;u32&gt; = positions.iter().map(|&amp;p| p as u32).collect();. The kernel reads uint32_t num_queries and positions[q] as u32. The upstream assert_u32_domain makes the LDE fit in u32 today, but the cast hides the dependency — a future relaxation of assert_u32_domain (or a different caller path) would corrupt proofs silently.

Suggested fix

Add debug_assert!(positions.iter().all(|&amp;p| p &lt;= u32::MAX as usize), "gather_proofs_dev: position exceeds u32 range") so any future regression is loud in tests.

AI-004: try_fri_query_phase_gpu only checks the first layer's gpu_tree before falling back
  • Status: candidate
  • Severity: low
  • Location: crypto/stark/src/gpu_lde.rs:1702
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The dispatch only inspects fri_layers[0].gpu_tree.is_none() to decide whether the layers are device-resident. While the current producers are all-or-nothing, the check is brittle: any future code path that builds some layers on CPU and others on GPU (e.g. a partial-failure mode) would silently route the whole query phase through the CPU walk — and the CPU walk would then call get_proof_by_pos on from_root placeholders, producing empty proofs.

Evidence

Line 1702: if fri_layers[0].gpu_tree.is_none() { return None; }. The body then unconditionally indexes layer.gpu_tree.as_ref().expect(...) (line 1714-1716) for every other layer. No check verifies consistency across layers, and the comment "FRI layers are device-resident as a group" documents the assumption rather than enforcing it.

Suggested fix

Validate every layer has gpu_tree.is_some() before taking the GPU path (or make the producer invariant explicit and add an assert!(layers.iter().all(|l| l.gpu_tree.is_some() || l.gpu_tree.is_none())) invariant). At minimum document that mixed residency is unsupported.

AI-005: GPU FRI gather runs on a pool stream different from the one that built the tree
  • Status: candidate
  • Severity: low
  • Location: crypto/stark/src/gpu_lde.rs:1705
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

try_fri_query_phase_gpu allocates a new pool stream (backend().next_stream()) for the FRI layer gather, but the FRI tree's nodes were allocated and written on FriCommitState.stream (a different pool stream chosen during FRI commit). While fold_and_commit_layer synchronizes that stream before returning, the gather kernel then reads from a cuMemAllocAsync allocation across two unrelated streams without explicit cross-stream synchronization.

Evidence

FriCommitState::new (fri.rs:72) calls be.next_stream() to pick the FRI stream; fold_and_commit_layer allocates nodes_dev on self.stream and synchronizes before returning (fri.rs:206). try_fri_query_phase_gpu (gpu_lde.rs:1705-1707) calls backend().next_stream() to pick an unrelated pool stream and dispatches the gather there. CUDA's default memory pool is globally visible, so the data is readable; however, the launch itself races the prior stream's destructor path and bypasses any per-stream dependency tracking. This is correct under CUDA's stream-ordered allocator but is fragile and easy to break.

Suggested fix

Either plumb the FRI stream through FriLayer so the gather runs on the same stream that wrote the tree, or call bound_stream() style on the layer to reuse the table's session stream. At minimum document the cross-stream assumption so future edits don't drop the only remaining safety net (the synchronize() at fri.rs:206).

AI-006: Instrument span tree mixes wall-clock and per-thread CPU time without documentation in the public API
  • Status: candidate
  • Severity: low
  • Location: crypto/stark/src/instruments.rs:10
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The module comment at line 10 asserts that "Spans open and close on the main thread at phase boundaries. They do not overlap and sum to their parent", but the code path in prove_rounds_2_to_4 opens many spans from rayon worker threads via let __sp = stark::instruments::span("…"). Multiple worker threads can open the same logical label simultaneously, and their Instant::now() reads cross thread boundaries, breaking the "wall-clock tree sums to parent" invariant the comment promises.

Evidence

prover.rs:2556, 2577, 2590, 2613, 2631, 2649, … and trace_builder.rs:2778, 2861, … call instruments::span(...) inside par_iter, par_map_collect, and chunk_and_generate closures which run on worker threads. The span guard stores wall-clock elapsed in TIMELINE, but spans from different threads may overlap (they don't nest on a single thread). The SPAN_DEPTH thread-local is documented as tracking depth, but two worker threads at "r1_main_commit" both see the same depth and the total wall is the union, not a sum-to-parent. The comment promises a true latency breakdown that the implementation does not provide.

Suggested fix

Either document that parallel-region spans are an approximation, or restrict span(...) to the main thread (assert inside SpanGuard::new) and replace the in-parallel calls with no-ops or single-thread timing around the rayon scope.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0
kimi openrouter/moonshotai/kimi-k2.7-code general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0
minimax minimax/MiniMax-M3 general success 6
moonmath zro/minimax-m3 general error: agentic lane timed out after 1800s 0
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro error: opencode failed (provider/auth/runtime error) and no verifications were submitted 0 0 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu 6

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

GPU Benchmark (ABBA) — b76600e4e7 vs main (14 pairs)

RTX 5090 · Vast.ai datacenter @ $0.8680555555555556/hr · prover/cuda · drift-free A/B/B/A

=== ABBA paired result  (improvement: + = PR faster) ===
  pairs: 14   mean A (PR): 22.473s   mean B (base): 24.315s

  [parametric] paired-t   mean +7.56%   sd 1.25%   se 0.34%
               95% CI: [+6.83%, +8.28%]   (t df=13 = 2.16)
  [robust]     median +7.30%   Wilcoxon W+=105 W-=0  p(exact)=0.0001  (z=+3.26)

  --- server stability (this run; compare across servers) ---
  run-to-run jitter:    A CV 1.04%   B CV 1.54%        (lower = steadier)
  within-session drift: -1.09% over the run, 1st->2nd half -0.57%
    (jitter -> Tier-1 cached gate floor; drift -> whether the cached baseline can be trusted)

  VERDICT: REAL IMPROVEMENT - PR faster by ~7.56% (t-CI and Wilcoxon agree)

  raw pairs: /tmp/abba_run/pairs.csv

+ = PR faster. Trust the verdict when paired-t and Wilcoxon agree.

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu 20

@yetanotherco yetanotherco deleted a comment from ColoCarletti Jun 30, 2026
@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

@MauroToscano

Copy link
Copy Markdown
Contributor

It seems to be faster, I'm going to re schedule the GPU machine to see if it's consistent across machines, since we are still experiment with that
image

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

2 similar comments
@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

@MauroToscano

Copy link
Copy Markdown
Contributor

Still showing speedup in another machine, will do one last try to confirm the benchmarking is stable now

image

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

1 similar comment
@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants