feat(cuda): keep LDE and Merkle trees resident on the GPU by ColoCarletti · Pull Request #748 · yetanotherco/lambda_vm

ColoCarletti · 2026-06-30T16:31:52Z

Summary

Keeps the STARK prover's LDE data and Merkle trees resident on the GPU across proving rounds instead of copying each full tree back to host. The prover now copies only the 32 byte root to host and gathers Merkle opening paths directly on device. Output is byte identical to the CPU path; on an ethrex 5 tx block this is about 7.6% faster end to end (17.30s -> 15.98s on an RTX 5090).

Why

The old GPU path built each Merkle tree on device, then copied the whole node array back to host, rebuilt a host side tree, and gathered proofs on CPU. For one prove that is about 1.11 GiB of device to host copies plus the host side tree work. Keeping the tree resident removes all of that. The win scales with trace size (the copy is proportional to tree size).

What changed

GPU residency

LDE columns and Merkle trees stay on device from round 1 to round 4, held by a per table GpuTableSession (main trace, aux trace, composition parts, plus the bound stream).
New GpuMerkleTree { nodes: Arc<CudaSlice>, leaves_len, root } holds the resident tree; only the root is copied to host.
from_root builds a root only host tree so the host proof object commitment without the full node array.

Device side proof gathering

merkle_gather_paths CUDA kernel plus gather_proofs_dev gather paths on device. A parity test asserts they match host get_proof_by_pos byte for byte.
try_fri_query_phase_gpu runs the FRI query openings against resident layer trees; composition parts fold on device and the composition tree to round 4.

Scheduling

VRAM budgeted chunking (estimate_table_vram_bytes, `plan_table_ bounds co resident tables. With no budget (non cuda, or budget not binding) it falls back to the previous fixed size scheme, so the CPU path

Rebased onto main's commitment rework

Adapts to row major LDE and row pair Merkle leaves (ROWS_PER_LEAF = 2): one proof per query authenticates both rows, no proof_sym.

Instrumentation

Optional wall clock span timeline behind the instruments feature. Zero overhead and no behavior change when the feature is off.

Correctness and validation

Byte identical proofs on GPU and CPU paths (the only difference Merkle proof comes from, and that source is parity tested).
End to end GPU prove and verify pass (cuda_path_integration), row pair commitment verification test.
All math-cuda kernel parity tests and the 137 stark tests pass.
Measured on an RTX 5090: ethrex 5 tx prove about 7.6% faster, about 1.11 GiB of tree device to host copies removed per prove.

Notes

Non cuda builds are behaviorally unchanged.
A transient GPU error during opening aborts the proof rather than silently falling back, since the host tree is root only.

ColoCarletti · 2026-06-30T16:48:24Z

/ai-review
/bench

github-actions · 2026-06-30T16:51:28Z

Codex Code Review

No issues found in the PR diff.

I reviewed the changed CUDA Merkle residency/proof gathering, FRI query path, GPU session handoff, VRAM chunking, and instrumentation changes statically. I did not build or run tests per the review constraints.

claude · 2026-06-30T16:54:59Z

Review: keep LDE and Merkle trees resident on the GPU

Reviewed for safety, correctness, performance, and readability. This is a solid, well-structured change — the residency model, root-only host trees, and device-side path gathering are coherent and the parity tests are the right gate. I verified the correctness-sensitive parts directly:

FRI query gather (try_fri_query_phase_gpu): opened position (iota>>l)>>1, symmetric eval evaluation[index^1], and per-layer auth-path ordering all match host query_phase exactly. Empty/CPU-commit layers correctly return None and fall back to the host walk.
Trace/composition gather: device gather position index matches host get_proof_by_pos(index) (row-pair leaves, one proof authenticates both rows), and open_*_with_proof builds evaluations identically to the host versions.
merkle_gather_paths kernel: sibling/parent index math mirrors CPU build_merkle_path; max sibling stays within 2*leaves_len-1, and the positions < leaves_len assert prevents OOB device reads. The u64 reinterpret is 32-byte aligned. Parity-tested byte-for-byte.
Stream safety: every resident tree is built on a stream that is synchronized before the handle is stored, so later gathers on a different (bound or pool) stream are race-free.
Scheduling: plan_table_chunks correctly admits at least one (oversized→solo) table per chunk and falls back to fixed-size k when budgeting is disabled; main/peak chunk plans each produce-and-consume their own results in order.

Two non-blocking notes left inline:

Medium — default mempool release threshold is u64::MAX (retains all freed device blocks indefinitely); on a long-lived multi-prove process with varying trace sizes this can grow VRAM via fragmentation. The 80% budget bounds concurrent working set, not retained-pool growth. Mitigated by LAMBDA_VM_MEMPOOL_RELEASE_MB.
Low — R4 opening / FRI query .expect() on transient GPU errors panics instead of returning ProvingError. Documented and intentional (no host fallback once keep path runs).

The instruments-gated timeline is zero-overhead when the feature is off and doesn't affect proof output. No correctness or memory-safety issues found in the hot paths.

🤖 Generated with Claude Code

github-actions · 2026-06-30T17:19:54Z

AI Review

PR #748 · 19 changed files

Findings

Status	Sev	Location	Finding	Found by
candidate	medium	crypto/crypto/src/merkle_tree/merkle.rs:260	Debug-only assertion on root-only MerkleTree in get_proof_by_pos leaks to release builds	minimax minimax/MiniMax-M3
candidate	low	crypto/math-cuda/src/merkle.rs:381	Stale doc-comment orphaned on build_comp_poly_tree_nodes_dev after removing build_comp_poly_tree_from_evals_ext3	minimax minimax/MiniMax-M3
candidate	low	crypto/stark/src/gpu_lde.rs:1449	positions: &[usize] → &[u32] truncation in gather_proofs_dev is silent	minimax minimax/MiniMax-M3
candidate	low	crypto/stark/src/gpu_lde.rs:1702	try_fri_query_phase_gpu only checks the first layer's gpu_tree before falling back	minimax minimax/MiniMax-M3
candidate	low	crypto/stark/src/gpu_lde.rs:1705	GPU FRI gather runs on a pool stream different from the one that built the tree	minimax minimax/MiniMax-M3
candidate	low	crypto/stark/src/instruments.rs:10	Instrument span tree mixes wall-clock and per-thread CPU time without documentation in the public API	minimax minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Debug-only assertion on root-only MerkleTree in get_proof_by_pos leaks to release builds

Status: candidate
Severity: medium
Location: crypto/crypto/src/merkle_tree/merkle.rs:260
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The new from_root constructor returns a MerkleTree with empty nodes; calling get_proof_by_pos on it in release builds silently returns Some(Proof{ merkle_path: vec![] }) because the new debug_assert!(!self.nodes.is_empty(), …) is compiled out. If any code path falls through to the host tree while gpu_composition_tree/gpu_main.tree/gpu_aux.tree was already moved/dropped, the proof bundle ships with empty merkle_paths and verification will fail (or, worse, succeed silently if the verifier accepts short proofs).

Evidence

Lines 256-269 of merkle.rs show the new debug_assert! is the only guard against misuse; in release this assertion is a no-op, and build_merkle_path walks while pos != ROOT which terminates immediately when node_count()==0 (the tree_depth = (0+1).ilog2() == 0 and the loop runs zero times), so an empty merkle_path is returned and wrapped in Some. The doc-comment on from_root even states "get_proof_by_pos must NOT be called on it", but release builds do not enforce that.

Suggested fix

Change debug_assert!(!self.nodes.is_empty(), ...) to assert!(!self.nodes.is_empty(), "get_proof_by_pos called on a root-only MerkleTree") so release builds also panic, OR have get_proof_by_pos return None on root-only trees and update callers (open_composition_poly, open_polys_with) to surface that as a hard error instead of unwrap(). Either fix prevents a silent empty-path proof.

AI-002: Stale doc-comment orphaned on build_comp_poly_tree_nodes_dev after removing build_comp_poly_tree_from_evals_ext3

Status: candidate
Severity: low
Location: crypto/math-cuda/src/merkle.rs:381
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The block-comment describing build_comp_poly_tree_from_evals_ext3 (the now-removed function returning Vec<u8>) is attached to build_comp_poly_tree_nodes_dev (the new internal helper). The two functions have different signatures and returns — readers expect the leading comment to describe the function it precedes.

Evidence

Lines 381-390 of merkle.rs read /// Row-pair Keccak leaf + Merkle tree build for R2 composition-polynomial commit. … /// Build the composition Merkle tree on device. Leaves hash row pairs, so … — the first paragraph is the old pub doc, the second is the new fn doc, both stuck on fn build_comp_poly_tree_nodes_dev(...). The function returns (CudaSlice<u8>, usize, Arc<CudaStream>); the orphaned comment claims it returns (2*(lde_size/2) - 1) * 32 bytes of tree nodes.

Suggested fix

Replace the orphaned block with a single doc-comment that describes build_comp_poly_tree_nodes_dev and its (nodes_dev, num_leaves, stream) return shape. The orphan adds reader confusion in an already dense file.

AI-003: positions: &[usize] → &[u32] truncation in gather_proofs_dev is silent

Status: candidate
Severity: low
Location: crypto/stark/src/gpu_lde.rs:1449
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

gather_proofs_dev casts positions: &[usize] to Vec<u32> via as u32 truncation. For an LDE size > 2^32 (currently impossible because of the assert_u32_domain check on lde_size), iotas would silently wrap and gather proofs for wrong leaves. The cast is silent; no debug_assert catches the truncation in release.

Evidence

Line 1449: let positions_u32: Vec<u32> = positions.iter().map(|&p| p as u32).collect();. The kernel reads uint32_t num_queries and positions[q] as u32. The upstream assert_u32_domain makes the LDE fit in u32 today, but the cast hides the dependency — a future relaxation of assert_u32_domain (or a different caller path) would corrupt proofs silently.

Suggested fix

Add debug_assert!(positions.iter().all(|&p| p <= u32::MAX as usize), "gather_proofs_dev: position exceeds u32 range") so any future regression is loud in tests.

AI-004: try_fri_query_phase_gpu only checks the first layer's gpu_tree before falling back

Status: candidate
Severity: low
Location: crypto/stark/src/gpu_lde.rs:1702
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The dispatch only inspects fri_layers[0].gpu_tree.is_none() to decide whether the layers are device-resident. While the current producers are all-or-nothing, the check is brittle: any future code path that builds some layers on CPU and others on GPU (e.g. a partial-failure mode) would silently route the whole query phase through the CPU walk — and the CPU walk would then call get_proof_by_pos on from_root placeholders, producing empty proofs.

Evidence

Line 1702: if fri_layers[0].gpu_tree.is_none() { return None; }. The body then unconditionally indexes layer.gpu_tree.as_ref().expect(...) (line 1714-1716) for every other layer. No check verifies consistency across layers, and the comment "FRI layers are device-resident as a group" documents the assumption rather than enforcing it.

Suggested fix

Validate every layer has gpu_tree.is_some() before taking the GPU path (or make the producer invariant explicit and add an assert!(layers.iter().all(|l| l.gpu_tree.is_some() || l.gpu_tree.is_none())) invariant). At minimum document that mixed residency is unsupported.

AI-005: GPU FRI gather runs on a pool stream different from the one that built the tree

Status: candidate
Severity: low
Location: crypto/stark/src/gpu_lde.rs:1705
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

try_fri_query_phase_gpu allocates a new pool stream (backend().next_stream()) for the FRI layer gather, but the FRI tree's nodes were allocated and written on FriCommitState.stream (a different pool stream chosen during FRI commit). While fold_and_commit_layer synchronizes that stream before returning, the gather kernel then reads from a cuMemAllocAsync allocation across two unrelated streams without explicit cross-stream synchronization.

Evidence

FriCommitState::new (fri.rs:72) calls be.next_stream() to pick the FRI stream; fold_and_commit_layer allocates nodes_dev on self.stream and synchronizes before returning (fri.rs:206). try_fri_query_phase_gpu (gpu_lde.rs:1705-1707) calls backend().next_stream() to pick an unrelated pool stream and dispatches the gather there. CUDA's default memory pool is globally visible, so the data is readable; however, the launch itself races the prior stream's destructor path and bypasses any per-stream dependency tracking. This is correct under CUDA's stream-ordered allocator but is fragile and easy to break.

Suggested fix

Either plumb the FRI stream through FriLayer so the gather runs on the same stream that wrote the tree, or call bound_stream() style on the layer to reuse the table's session stream. At minimum document the cross-stream assumption so future edits don't drop the only remaining safety net (the synchronize() at fri.rs:206).

AI-006: Instrument span tree mixes wall-clock and per-thread CPU time without documentation in the public API

Status: candidate
Severity: low
Location: crypto/stark/src/instruments.rs:10
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The module comment at line 10 asserts that "Spans open and close on the main thread at phase boundaries. They do not overlap and sum to their parent", but the code path in prove_rounds_2_to_4 opens many spans from rayon worker threads via let __sp = stark::instruments::span("…"). Multiple worker threads can open the same logical label simultaneously, and their Instant::now() reads cross thread boundaries, breaking the "wall-clock tree sums to parent" invariant the comment promises.

Evidence

prover.rs:2556, 2577, 2590, 2613, 2631, 2649, … and trace_builder.rs:2778, 2861, … call instruments::span(...) inside par_iter, par_map_collect, and chunk_and_generate closures which run on worker threads. The span guard stores wall-clock elapsed in TIMELINE, but spans from different threads may overlap (they don't nest on a single thread). The SPAN_DEPTH thread-local is documented as tracking depth, but two worker threads at "r1_main_commit" both see the same depth and the total wall is the union, not a sum-to-parent. The comment promises a true latency breakdown that the implementation does not provide.

Suggested fix

Either document that parallel-region spans are an approximation, or restrict span(...) to the main thread (assert inside SpanGuard::new) and replace the in-parallel calls with no-ops or single-thread timing around the rayon scope.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0
kimi	openrouter/moonshotai/kimi-k2.7-code	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0
minimax	minimax/MiniMax-M3	general	success	6
moonmath	zro/minimax-m3	general	error: agentic lane timed out after 1800s	0
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	error: opencode failed (provider/auth/runtime error) and no verifications were submitted	0	0	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

MauroToscano · 2026-06-30T17:22:59Z

/bench-gpu 6

github-actions · 2026-06-30T17:23:09Z

GPU Benchmark (ABBA) — `b76600e4e7` vs `main` (14 pairs)

_{RTX 5090 · Vast.ai datacenter @ $0.8680555555555556/hr · prover/cuda · drift-free A/B/B/A}

=== ABBA paired result  (improvement: + = PR faster) ===
  pairs: 14   mean A (PR): 22.473s   mean B (base): 24.315s

  [parametric] paired-t   mean +7.56%   sd 1.25%   se 0.34%
               95% CI: [+6.83%, +8.28%]   (t df=13 = 2.16)
  [robust]     median +7.30%   Wilcoxon W+=105 W-=0  p(exact)=0.0001  (z=+3.26)

  --- server stability (this run; compare across servers) ---
  run-to-run jitter:    A CV 1.04%   B CV 1.54%        (lower = steadier)
  within-session drift: -1.09% over the run, 1st->2nd half -0.57%
    (jitter -> Tier-1 cached gate floor; drift -> whether the cached baseline can be trusted)

  VERDICT: REAL IMPROVEMENT - PR faster by ~7.56% (t-CI and Wilcoxon agree)

  raw pairs: /tmp/abba_run/pairs.csv

_{+ = PR faster. Trust the verdict when paired-t and Wilcoxon agree.}

MauroToscano · 2026-06-30T19:13:56Z

/bench-gpu 20

…nto gpu_integration

MauroToscano · 2026-06-30T19:49:23Z

/bench-gpu

MauroToscano · 2026-06-30T20:32:34Z

/bench-gpu

MauroToscano · 2026-06-30T20:59:12Z

/bench-gpu

MauroToscano · 2026-06-30T21:13:50Z

It seems to be faster, I'm going to re schedule the GPU machine to see if it's consistent across machines, since we are still experiment with that

MauroToscano · 2026-06-30T21:14:10Z

/bench-gpu

MauroToscano · 2026-06-30T21:14:23Z

/bench-gpu

MauroToscano · 2026-06-30T21:15:43Z

/bench-gpu

MauroToscano · 2026-06-30T21:37:52Z

Still showing speedup in another machine, will do one last try to confirm the benchmarking is stable now

MauroToscano · 2026-06-30T21:37:56Z

/bench-gpu

MauroToscano · 2026-06-30T21:48:39Z

/bench-gpu

ColoCarletti added 17 commits June 29, 2026 15:08

spike

5f369f0

resident-data scaffolding

0fa7f32

part 2

bdea00e

composition parts fold

e9dcda8

fix

b08ef81

fix_doc

8ea8a0a

move merkle tree to gpu

caa7efd

merkle tree

b811337

finish merkle

819ad01

fix

0fb8df6

rm unused functions

8ab633f

cleanup

fd31fc9

Merge branch 'main' into gpu_integration

0f53e65

fix

b5fbdef

fix

b37b649

Merge branch 'main' into gpu_integration

b81e63f

fix clippy

873491e

claude Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/device.rs

claude Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

ColoCarletti and others added 2 commits June 30, 2026 15:02

fix comments

e4ada64

Merge branch 'main' into gpu_integration

31d5524

ColoCarletti added 2 commits June 30, 2026 16:29

fix

4ca240c

Merge branch 'gpu_integration' of github.com:yetanotherco/lambda_vm i…

45b2cfe

…nto gpu_integration

yetanotherco deleted a comment from ColoCarletti Jun 30, 2026

Merge branch 'main' into gpu_integration

265089e

Merge branch 'main' into gpu_integration

916f39b

Merge branch 'main' into gpu_integration

b76600e

Uh oh!

Conversation

ColoCarletti commented Jun 30, 2026

Summary

Why

What changed

Correctness and validation

Notes

Uh oh!

ColoCarletti commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 30, 2026

Review: keep LDE and Merkle trees resident on the GPU

Uh oh!

github-actions Bot commented Jun 30, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Benchmark (ABBA) — b76600e4e7 vs main (14 pairs)

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 30, 2026 •

edited

Loading

GPU Benchmark (ABBA) — `b76600e4e7` vs `main` (14 pairs)