Skip to content

ci: run tests on GPU server#747

Open
JuArce wants to merge 9 commits into
mainfrom
ci_run_tests_gpu
Open

ci: run tests on GPU server#747
JuArce wants to merge 9 commits into
mainfrom
ci_run_tests_gpu

Conversation

@JuArce

@JuArce JuArce commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

Adds a GPU test suite that runs in the merge queue and blocks the merge if it fails. It rents an RTX 5090 on Vast.ai, runs the CUDA tests there, and always destroys the box — reusing the rent → provision → run → always-destroy orchestration from benchmark-gpu.yml.

Why

CPU CI (pr_main.yaml) runs on GitHub ubuntu-latest runners, which have no GPU, so the CUDA backend is never exercised before merge. This suite runs the GPU-relevant tests on real hardware.

Test groups (run by scripts/gpu_test.sh via Makefile targets)

# Target Covers
1 test-math-cuda GPU↔CPU kernel parity (NTT, LDE, barycentric, FRI, merkle, ext3, keccak, …)
2 test-cuda-integration End-to-end GPU prove: every dispatch fires + the proof verifies
3 test-cuda-fallback GPU dispatch error → CPU fallback still produces a verifying proof
4 test-prover-cuda The lambda-vm-prover/stark/crypto/ecsm suite on the GPU path (CPU CI's sharded prover tests, run single-threaded with --features cuda)
5 test-prover-comprehensive-cuda The comprehensive all-instructions prove (test_prove_elfs_all_instructions_64_full) on the GPU path

Groups 1–3 are GPU-only (no CPU equivalent); 4–5 mirror CPU CI's prover jobs but with the CUDA path enabled. All groups run even if one fails; the job fails if any group does.

How it works

  • Trigger: merge_group (one GPU rental per merge, not per push) + workflow_dispatch. A failure ejects the PR from the queue.
  • The box checks out the merge ref and runs scripts/gpu_test.sh, which:
    • pins cudarc (CUDARC_PIN=cuda-12080) and verifies the GPU toolchain,
    • builds the guest ELFs (compile-programs-asm + compile-programs-rust),
    • runs the 5 groups, aggregating failures.
  • Offer selection: RTX 5090, ≥16 cores, ≥96 GB RAM, ≥64 GB disk, cuda_max_good>=13.1, driver ≥ 580, most-expensive ≤ $1/hr, with a retry loop for transient scarcity.
  • Teardown always destroys the instance (--yes, 3× retry, plus a label-based sweep so a cancelled-mid-rent box can't leak).
  • Reporting: full stdout+stderr in the step log; the job summary lists failed tests grouped by suite and the panic/assertion message for each.

CUDA version requirement

An NVIDIA driver supporting CUDA ≥ 13.1 is required: the kernels are compiled with the toolkit's nvcc (CUDA 13.1) into PTX that the driver JIT-compiles at load — an older driver rejects it with CUDA_ERROR_UNSUPPORTED_PTX_VERSION. Enforced by the cuda_max_good>=13.1 offer filter and documented in the README "GPU Tests" section + crypto/math-cuda/build.rs.

Files

  • .github/workflows/gpu-tests.yml — new workflow (job gpu-tests).
  • scripts/gpu_test.sh — new on-box runner (cudarc pin + compile + 5 groups, aggregate exit).
  • Makefile — adds test-cuda-fallback, test-prover-cuda, test-prover-comprehensive-cuda (groups 1 & 2 targets already existed).
  • README.md — GPU Tests section + targets table.
  • crypto/math-cuda/build.rs — comment clarifying the PTX/driver version coupling.

Before this gates merges (manual)

  • Add gpu-tests to main's required status checks (Settings → Branches/Rulesets) — same pattern as the merge-queue-only test-prover-comprehensive.
  • Repo secrets VAST_API_KEY and VAST_TEMPLATE_HASH are already set.

Status

All 5 groups pass end-to-end on a rented RTX 5090. One real test-harness bug was fixed along the way: test-cuda-fallback ran its two tests in parallel, racing on the process-global GPU dispatch counters (assert_eq! count mismatches) — fixed by adding --test-threads=1 to that Makefile target.

Notes

  • Tradeoff: as a merge gate, a persistent Vast outage (no offer after retries) would block merges until it clears; the retry loop + generous waits mitigate it.

@JuArce JuArce self-assigned this Jun 30, 2026
@JuArce JuArce marked this pull request as ready for review June 30, 2026 19:17
@JuArce JuArce added the ai-review Trigger the AI review label Jun 30, 2026
@github-actions

Copy link
Copy Markdown

Codex Code Review

No findings.

I reviewed the PR diff only: new GPU merge-queue workflow, GPU test script, Make targets, README additions, and the math-cuda build comment. I did not identify concrete safety/security, VM semantics, GPU resource, correctness, or significant performance issues in the changed code.

Verification was static only per instructions; I did not build or run tests.

Comment thread scripts/gpu_test.sh
# 4. prover/stark/crypto/ecsm suite (make test-prover-cuda) — CPU CI's prover tests on GPU
# 5. comprehensive all-instructions (make test-prover-comprehensive-cuda)
#
# Runs on the rented Vast box from the gpu-tests.yml merge-queue workflow. All three groups

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment: this says "All three groups run even if one fails" but the script now runs five groups (1–5, listed just above). Update the count to avoid confusion.

Comment thread scripts/gpu_test.sh
# crypto/math-cuda/Cargo.toml uses `cuda-version-from-build-system` + `fallback-latest`;
# when detection falls back to "latest", cudarc requests symbols some boxes' driver doesn't
# export (e.g. cuDevSmResourceSplit / cuCtxGetDevice_v2) -> runtime panic. Pinning to a fixed
# CUDA version (12.8, matching the cuda_max_good>=12.8 offer floor) avoids that.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parenthetical ("matching the cuda_max_good>=12.8 offer floor") is stale relative to this PR. The new gpu-tests.yml uses an offer floor of cuda_max_good>=13.1, not 12.8. The 12.8 cudarc pin is still correct (it's the host-side driver-symbol floor, independent of the PTX/driver-13.1 requirement), but the comment now references a floor that doesn't exist in this workflow. Reword to reflect that the cudarc pin floor (12.8) and the offer's PTX-driven floor (13.1) are separate, so a future reader doesn't try to "reconcile" them.

@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review: GPU test suite in merge queue

This is an infra/CI PR (workflow + Makefile targets + scripts + docs). I verified the Makefile targets reference real features and test files (test-cuda-faults, cuda_fallback_tests, cuda_path_integration all exist), and the orchestration is well-built. Notable strengths: secrets kept out of the pip-install step, pinned Vast CLI commit, ephemeral per-instance SSH key, ref validation before remote bash -lc, pipefail so test failures aren't masked by tee, and a robust always-destroy teardown with a label-based leak sweep.

No safety, correctness, or performance issues found. Two minor readability nits (posted inline):

  • Low — stale comment: scripts/gpu_test.sh:12 says "All three groups run" but five groups now run.
  • Low — stale comment: scripts/gpu_test.sh:42 references a cuda_max_good>=12.8 offer floor; the new workflow's floor is 13.1. The 12.8 cudarc pin itself is correct — it's the host-side symbol floor and is independent of the PTX/driver-13.1 requirement — but the comment conflates the two.

Both are cosmetic; the workflow is sound. Note the PR's own caveat stands: as a required merge gate, a sustained Vast.ai outage (no offer after the retry loop) would block merges until it clears — an accepted tradeoff per the description.

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-growth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review Trigger the AI review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants