ci: run tests on GPU server#747
Conversation
Codex Code ReviewNo findings. I reviewed the PR diff only: new GPU merge-queue workflow, GPU test script, Make targets, README additions, and the Verification was static only per instructions; I did not build or run tests. |
| # 4. prover/stark/crypto/ecsm suite (make test-prover-cuda) — CPU CI's prover tests on GPU | ||
| # 5. comprehensive all-instructions (make test-prover-comprehensive-cuda) | ||
| # | ||
| # Runs on the rented Vast box from the gpu-tests.yml merge-queue workflow. All three groups |
There was a problem hiding this comment.
Stale comment: this says "All three groups run even if one fails" but the script now runs five groups (1–5, listed just above). Update the count to avoid confusion.
| # crypto/math-cuda/Cargo.toml uses `cuda-version-from-build-system` + `fallback-latest`; | ||
| # when detection falls back to "latest", cudarc requests symbols some boxes' driver doesn't | ||
| # export (e.g. cuDevSmResourceSplit / cuCtxGetDevice_v2) -> runtime panic. Pinning to a fixed | ||
| # CUDA version (12.8, matching the cuda_max_good>=12.8 offer floor) avoids that. |
There was a problem hiding this comment.
This parenthetical ("matching the cuda_max_good>=12.8 offer floor") is stale relative to this PR. The new gpu-tests.yml uses an offer floor of cuda_max_good>=13.1, not 12.8. The 12.8 cudarc pin is still correct (it's the host-side driver-symbol floor, independent of the PTX/driver-13.1 requirement), but the comment now references a floor that doesn't exist in this workflow. Reword to reflect that the cudarc pin floor (12.8) and the offer's PTX-driven floor (13.1) are separate, so a future reader doesn't try to "reconcile" them.
Review: GPU test suite in merge queueThis is an infra/CI PR (workflow + Makefile targets + scripts + docs). I verified the Makefile targets reference real features and test files ( No safety, correctness, or performance issues found. Two minor readability nits (posted inline):
Both are cosmetic; the workflow is sound. Note the PR's own caveat stands: as a required merge gate, a sustained Vast.ai outage (no offer after the retry loop) would block merges until it clears — an accepted tradeoff per the description. |
|
/bench-growth |
What
Adds a GPU test suite that runs in the merge queue and blocks the merge if it fails. It rents an RTX 5090 on Vast.ai, runs the CUDA tests there, and always destroys the box — reusing the rent → provision → run → always-destroy orchestration from
benchmark-gpu.yml.Why
CPU CI (
pr_main.yaml) runs on GitHububuntu-latestrunners, which have no GPU, so the CUDA backend is never exercised before merge. This suite runs the GPU-relevant tests on real hardware.Test groups (run by
scripts/gpu_test.shvia Makefile targets)test-math-cudatest-cuda-integrationtest-cuda-fallbacktest-prover-cudalambda-vm-prover/stark/crypto/ecsmsuite on the GPU path (CPU CI's sharded prover tests, run single-threaded with--features cuda)test-prover-comprehensive-cudatest_prove_elfs_all_instructions_64_full) on the GPU pathGroups 1–3 are GPU-only (no CPU equivalent); 4–5 mirror CPU CI's prover jobs but with the CUDA path enabled. All groups run even if one fails; the job fails if any group does.
How it works
merge_group(one GPU rental per merge, not per push) +workflow_dispatch. A failure ejects the PR from the queue.scripts/gpu_test.sh, which:CUDARC_PIN=cuda-12080) and verifies the GPU toolchain,compile-programs-asm+compile-programs-rust),cuda_max_good>=13.1, driver ≥ 580, most-expensive ≤$1/hr, with a retry loop for transient scarcity.--yes, 3× retry, plus a label-based sweep so a cancelled-mid-rent box can't leak).CUDA version requirement
An NVIDIA driver supporting CUDA ≥ 13.1 is required: the kernels are compiled with the toolkit's
nvcc(CUDA 13.1) into PTX that the driver JIT-compiles at load — an older driver rejects it withCUDA_ERROR_UNSUPPORTED_PTX_VERSION. Enforced by thecuda_max_good>=13.1offer filter and documented in the README "GPU Tests" section +crypto/math-cuda/build.rs.Files
.github/workflows/gpu-tests.yml— new workflow (jobgpu-tests).scripts/gpu_test.sh— new on-box runner (cudarc pin + compile + 5 groups, aggregate exit).Makefile— addstest-cuda-fallback,test-prover-cuda,test-prover-comprehensive-cuda(groups 1 & 2 targets already existed).README.md— GPU Tests section + targets table.crypto/math-cuda/build.rs— comment clarifying the PTX/driver version coupling.Before this gates merges (manual)
gpu-teststomain's required status checks (Settings → Branches/Rulesets) — same pattern as the merge-queue-onlytest-prover-comprehensive.VAST_API_KEYandVAST_TEMPLATE_HASHare already set.Status
All 5 groups pass end-to-end on a rented RTX 5090. One real test-harness bug was fixed along the way:
test-cuda-fallbackran its two tests in parallel, racing on the process-global GPU dispatch counters (assert_eq!count mismatches) — fixed by adding--test-threads=1to that Makefile target.Notes