feat(cuda): keep LDE and Merkle trees resident on the GPU#748
feat(cuda): keep LDE and Merkle trees resident on the GPU#748ColoCarletti wants to merge 24 commits into
Conversation
|
/ai-review |
Codex Code ReviewNo issues found in the PR diff. I reviewed the changed CUDA Merkle residency/proof gathering, FRI query path, GPU session handoff, VRAM chunking, and instrumentation changes statically. I did not build or run tests per the review constraints. |
Review: keep LDE and Merkle trees resident on the GPUReviewed for safety, correctness, performance, and readability. This is a solid, well-structured change — the residency model, root-only host trees, and device-side path gathering are coherent and the parity tests are the right gate. I verified the correctness-sensitive parts directly:
Two non-blocking notes left inline:
The 🤖 Generated with Claude Code |
AI ReviewPR #748 · 19 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-001: Debug-only assertion on root-only MerkleTree in get_proof_by_pos leaks to release builds
Claim The new Evidence Lines 256-269 of merkle.rs show the new Suggested fix Change AI-002: Stale doc-comment orphaned on build_comp_poly_tree_nodes_dev after removing build_comp_poly_tree_from_evals_ext3
Claim The block-comment describing Evidence Lines 381-390 of merkle.rs read Suggested fix Replace the orphaned block with a single doc-comment that describes AI-003: positions: &[usize] → &[u32] truncation in gather_proofs_dev is silent
Claim
Evidence Line 1449: Suggested fix Add AI-004: try_fri_query_phase_gpu only checks the first layer's gpu_tree before falling back
Claim The dispatch only inspects Evidence Line 1702: Suggested fix Validate every layer has AI-005: GPU FRI gather runs on a pool stream different from the one that built the tree
Claim
Evidence
Suggested fix Either plumb the FRI stream through AI-006: Instrument span tree mixes wall-clock and per-thread CPU time without documentation in the public API
Claim The module comment at line 10 asserts that "Spans open and close on the main thread at phase boundaries. They do not overlap and sum to their parent", but the code path in Evidence
Suggested fix Either document that parallel-region spans are an approximation, or restrict Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
|
/bench-gpu 6 |
GPU Benchmark (ABBA) —
|
|
/bench-gpu 20 |
…nto gpu_integration
|
/bench-gpu |
|
/bench-gpu |
|
/bench-gpu |
|
/bench-gpu |
2 similar comments
|
/bench-gpu |
|
/bench-gpu |
|
/bench-gpu |
1 similar comment
|
/bench-gpu |


Summary
Keeps the STARK prover's LDE data and Merkle trees resident on the GPU across proving rounds instead of copying each full tree back to host. The prover now copies only the 32 byte root to host and gathers Merkle opening paths directly on device. Output is byte identical to the CPU path; on an ethrex 5 tx block this is about 7.6% faster end to end (17.30s -> 15.98s on an RTX 5090).
Why
The old GPU path built each Merkle tree on device, then copied the whole node array back to host, rebuilt a host side tree, and gathered proofs on CPU. For one prove that is about 1.11 GiB of device to host copies plus the host side tree work. Keeping the tree resident removes all of that. The win scales with trace size (the copy is proportional to tree size).
What changed
GPU residency
GpuTableSession(main trace, aux trace, composition parts, plus the bound stream).GpuMerkleTree { nodes: Arc<CudaSlice>, leaves_len, root }holds the resident tree; only the root is copied to host.from_rootbuilds a root only host tree so the host proof object commitment without the full node array.Device side proof gathering
merkle_gather_pathsCUDA kernel plusgather_proofs_devgather paths on device. A parity test asserts they match hostget_proof_by_posbyte for byte.try_fri_query_phase_gpuruns the FRI query openings against resident layer trees; composition parts fold on device and the composition tree to round 4.Scheduling
estimate_table_vram_bytes, `plan_table_ bounds co resident tables. With no budget (non cuda, or budget not binding) it falls back to the previous fixed size scheme, so the CPU pathRebased onto main's commitment rework
ROWS_PER_LEAF = 2): one proof per query authenticates both rows, noproof_sym.Instrumentation
instrumentsfeature. Zero overhead and no behavior change when the feature is off.Correctness and validation
cuda_path_integration), row pair commitment verification test.Notes