GPU TEE attestation: verify GPU confidential mode and bind it to the CVM (currently unverified)

## Problem

dstack attaches NVIDIA GPUs to CVMs but never verifies them. The CPU-side stack has no GPU/NRAS logic, and the guest OS flips the GPU to "ready" **unconditionally** at boot, before the workload runs. Consequences:

- A **non-confidential GPU** (CC off in host BIOS, or a normal GPU) is exposed **plaintext** to the workload with no error that stops it — no attacker required.
- Even a genuine CC GPU is **not bound to the CVM**: nothing lets a relying party (or the KMS) trust that a specific attested CVM is backed by a genuine GPU in confidential mode.

"Dual attestation" in `docs/security/security-model.md` is a documented *expectation*, not something enforced anywhere.

Full design write-up: **`docs/gpu-attestation-design.md`** (branch `gpu-tee-nras-verification`).

## Current gaps (evidence)

- No CPU-side GPU verification — `AttestationQuote` is CPU-TEE only; `VerificationDetails` has no GPU fields (`verifier/src/verification.rs`).
- GPU attached as plain `vfio-pci`, no CC-mode/attestation (`vmm/src/app/qemu.rs:840`).
- `nvidia-smi conf-compute -srs 1` run unconditionally at boot, no `cc_mode` check (meta-dstack `nvidia-persistenced.service`).
- `app-compose.service` has **no dependency** on GPU bring-up — the workload starts even if it fails.
- No app-facing RPC to query/enable GPU confidentiality.

## Threat model (what this closes / doesn't)

| Attack | Closed? |
|---|---|
| Forge GPU evidence (no NVIDIA key) | ✅ cert chain + RIM, verifier in measured guest |
| Non-CC / CC-off GPU silently used | ✅ fail-fast, `cc_mode==ON` |
| **Copy attack** — GPU-less instance B copies a GPU-attested value from A | ✅ only if verdict lives in append-only measured pre-app state; ❌ if in `report_data` (app-forgeable via `guest-agent/src/rpc_service.rs:327`) |
| Stale/replayed GPU evidence | ✅ fresh boot nonce |
| **Live relay / cuckoo** to a genuine remote CC GPU | ❌ residual — no shipping system defeats it; needs TEE-IO/TDISP hardware. Document, don't imply "GPU proven local." |

## Workstreams

### 1. Offline local verifier (NVAT)
- [ ] Adopt NVIDIA C++ Attestation SDK (`libnvat`, [NVIDIA/attestation-sdk](https://github.com/NVIDIA/attestation-sdk)); Python `nvtrust` is EOL 2026-09-15.
- [ ] Bake `libnvat` + pre-provisioned filesystem RIM store into the NVIDIA image (re-provision on driver/VBIOS upgrade).
- [ ] Handle OCSP (the one online dep, `ocsp.ndis.nvidia.com`): in-CVM caching/replay proxy at `--ocsp-url`, and/or a Rego policy tolerating `x-nvidia-cert-ocsp-status`.

### 2. Binding via measured append-only state (before the app boundary)
- [ ] `dstack-util` emits a `gpu-attestation` event committing `H(nvat_eat‖cert_chain‖claims)` **before** `system-ready`.
- [ ] Verifier trusts it via the existing `find_event` boundary (breaks at `system-ready`, `dstack-attest/src/attestation.rs:176`) → RTMR3-bound on TDX.
- [ ] `report_data` stays for freshness + RA-TLS key binding only — **not** the GPU verdict.

### 3. Fail-fast enforcement & app gate
- [ ] `dstack-gpu-attest.service`: verify every configured GPU (`num_gpus>0` is measured), require `cc_mode==ON` (reject `OFF` **and** `DEVTOOLS`), set `-srs 1` only on pass.
- [ ] `app-compose.service` `Requires=`+`After=dstack-gpu-attest.service`; fail closed (no `system-ready`, no workload).
- [ ] Guest-agent RPC `GetGpuAttestation()` / `EnsureGpuReady()` so apps can confirm/enable before use.

### 4. KMS gating, verifier surfacing, docs
- [ ] KMS gates key release on the `gpu-attestation` event (like `compose-hash`).
- [ ] Verifier parses NVAT EAT/claims into `VerificationDetails`.
- [ ] `security-model.md`: state the guarantee (*measured-guest-vouches + channel-bound*) and the co-location residual.

## Caveats

- **SEV-SNP is blocked**: no runtime measurement register (`tpm_runtime_pcr()=None`, `has_tdx()=false`); `decode_app_info_sev_snp` reads identity from launch-time `HOST_DATA`/`MrConfigV3` and ignores the runtime log. SNP needs an **SVSM/coconut-vTPM** (PCR channel) before this works — until then SNP GPU attestation is strictly weaker than TDX. (See #713.)
- **DEVTOOLS** CC mode enables CC APIs *without* memory encryption — must be rejected.
- **Co-location residual** stands until Blackwell TEE-IO/TDISP.

## Hardware validation checklist (unconfirmed in docs)

- [ ] NVAT behavior on OCSP connection failure (fail-open vs closed); can Rego neutralize revocation?
- [ ] OCSP `nextUpdate` window; warm-cache ride-through.
- [ ] Exact `libnvat` env-var spellings; trust-root store overridability.
- [ ] Does `-srs 1` error or no-op on a CC-off GPU?
- [ ] Does the driver expose SPDM `KEY_EXCHANGE` key (optional session-binding stretch)?
- [ ] SNP SVSM vTPM feasibility in meta-dstack.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU TEE attestation: verify GPU confidential mode and bind it to the CVM (currently unverified) #751

Problem

Current gaps (evidence)

Threat model (what this closes / doesn't)

Workstreams

1. Offline local verifier (NVAT)

2. Binding via measured append-only state (before the app boundary)

3. Fail-fast enforcement & app gate

4. KMS gating, verifier surfacing, docs

Caveats

Hardware validation checklist (unconfirmed in docs)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attack	Closed?
Forge GPU evidence (no NVIDIA key)	✅ cert chain + RIM, verifier in measured guest
Non-CC / CC-off GPU silently used	✅ fail-fast, `cc_mode==ON`
Copy attack — GPU-less instance B copies a GPU-attested value from A	✅ only if verdict lives in append-only measured pre-app state; ❌ if in `report_data` (app-forgeable via `guest-agent/src/rpc_service.rs:327`)
Stale/replayed GPU evidence	✅ fresh boot nonce
Live relay / cuckoo to a genuine remote CC GPU	❌ residual — no shipping system defeats it; needs TEE-IO/TDISP hardware. Document, don't imply "GPU proven local."

Uh oh!

GPU TEE attestation: verify GPU confidential mode and bind it to the CVM (currently unverified) #751

Description

Problem

Current gaps (evidence)

Threat model (what this closes / doesn't)

Workstreams

1. Offline local verifier (NVAT)

2. Binding via measured append-only state (before the app boundary)

3. Fail-fast enforcement & app gate

4. KMS gating, verifier surfacing, docs

Caveats

Hardware validation checklist (unconfirmed in docs)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions