You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dstack attaches NVIDIA GPUs to CVMs but never verifies them. The CPU-side stack has no GPU/NRAS logic, and the guest OS flips the GPU to "ready" unconditionally at boot, before the workload runs. Consequences:
A non-confidential GPU (CC off in host BIOS, or a normal GPU) is exposed plaintext to the workload with no error that stops it — no attacker required.
Even a genuine CC GPU is not bound to the CVM: nothing lets a relying party (or the KMS) trust that a specific attested CVM is backed by a genuine GPU in confidential mode.
"Dual attestation" in docs/security/security-model.md is a documented expectation, not something enforced anywhere.
Full design write-up: docs/gpu-attestation-design.md (branch gpu-tee-nras-verification).
Current gaps (evidence)
No CPU-side GPU verification — AttestationQuote is CPU-TEE only; VerificationDetails has no GPU fields (verifier/src/verification.rs).
GPU attached as plain vfio-pci, no CC-mode/attestation (vmm/src/app/qemu.rs:840).
nvidia-smi conf-compute -srs 1 run unconditionally at boot, no cc_mode check (meta-dstack nvidia-persistenced.service).
app-compose.service has no dependency on GPU bring-up — the workload starts even if it fails.
No app-facing RPC to query/enable GPU confidentiality.
Threat model (what this closes / doesn't)
Attack
Closed?
Forge GPU evidence (no NVIDIA key)
✅ cert chain + RIM, verifier in measured guest
Non-CC / CC-off GPU silently used
✅ fail-fast, cc_mode==ON
Copy attack — GPU-less instance B copies a GPU-attested value from A
✅ only if verdict lives in append-only measured pre-app state; ❌ if in report_data (app-forgeable via guest-agent/src/rpc_service.rs:327)
Stale/replayed GPU evidence
✅ fresh boot nonce
Live relay / cuckoo to a genuine remote CC GPU
❌ residual — no shipping system defeats it; needs TEE-IO/TDISP hardware. Document, don't imply "GPU proven local."
Workstreams
1. Offline local verifier (NVAT)
Adopt NVIDIA C++ Attestation SDK (libnvat, NVIDIA/attestation-sdk); Python nvtrust is EOL 2026-09-15.
Bake libnvat + pre-provisioned filesystem RIM store into the NVIDIA image (re-provision on driver/VBIOS upgrade).
Handle OCSP (the one online dep, ocsp.ndis.nvidia.com): in-CVM caching/replay proxy at --ocsp-url, and/or a Rego policy tolerating x-nvidia-cert-ocsp-status.
2. Binding via measured append-only state (before the app boundary)
dstack-util emits a gpu-attestation event committing H(nvat_eat‖cert_chain‖claims)beforesystem-ready.
Verifier trusts it via the existing find_event boundary (breaks at system-ready, dstack-attest/src/attestation.rs:176) → RTMR3-bound on TDX.
report_data stays for freshness + RA-TLS key binding only — not the GPU verdict.
3. Fail-fast enforcement & app gate
dstack-gpu-attest.service: verify every configured GPU (num_gpus>0 is measured), require cc_mode==ON (reject OFFandDEVTOOLS), set -srs 1 only on pass.
app-compose.serviceRequires=+After=dstack-gpu-attest.service; fail closed (no system-ready, no workload).
Guest-agent RPC GetGpuAttestation() / EnsureGpuReady() so apps can confirm/enable before use.
4. KMS gating, verifier surfacing, docs
KMS gates key release on the gpu-attestation event (like compose-hash).
Verifier parses NVAT EAT/claims into VerificationDetails.
security-model.md: state the guarantee (measured-guest-vouches + channel-bound) and the co-location residual.
Caveats
SEV-SNP is blocked: no runtime measurement register (tpm_runtime_pcr()=None, has_tdx()=false); decode_app_info_sev_snp reads identity from launch-time HOST_DATA/MrConfigV3 and ignores the runtime log. SNP needs an SVSM/coconut-vTPM (PCR channel) before this works — until then SNP GPU attestation is strictly weaker than TDX. (See AMD SEV-SNP support (tracking) #713.)
DEVTOOLS CC mode enables CC APIs without memory encryption — must be rejected.
Co-location residual stands until Blackwell TEE-IO/TDISP.
Hardware validation checklist (unconfirmed in docs)
NVAT behavior on OCSP connection failure (fail-open vs closed); can Rego neutralize revocation?
OCSP nextUpdate window; warm-cache ride-through.
Exact libnvat env-var spellings; trust-root store overridability.
Does -srs 1 error or no-op on a CC-off GPU?
Does the driver expose SPDM KEY_EXCHANGE key (optional session-binding stretch)?
Problem
dstack attaches NVIDIA GPUs to CVMs but never verifies them. The CPU-side stack has no GPU/NRAS logic, and the guest OS flips the GPU to "ready" unconditionally at boot, before the workload runs. Consequences:
"Dual attestation" in
docs/security/security-model.mdis a documented expectation, not something enforced anywhere.Full design write-up:
docs/gpu-attestation-design.md(branchgpu-tee-nras-verification).Current gaps (evidence)
AttestationQuoteis CPU-TEE only;VerificationDetailshas no GPU fields (verifier/src/verification.rs).vfio-pci, no CC-mode/attestation (vmm/src/app/qemu.rs:840).nvidia-smi conf-compute -srs 1run unconditionally at boot, nocc_modecheck (meta-dstacknvidia-persistenced.service).app-compose.servicehas no dependency on GPU bring-up — the workload starts even if it fails.Threat model (what this closes / doesn't)
cc_mode==ONreport_data(app-forgeable viaguest-agent/src/rpc_service.rs:327)Workstreams
1. Offline local verifier (NVAT)
libnvat, NVIDIA/attestation-sdk); Pythonnvtrustis EOL 2026-09-15.libnvat+ pre-provisioned filesystem RIM store into the NVIDIA image (re-provision on driver/VBIOS upgrade).ocsp.ndis.nvidia.com): in-CVM caching/replay proxy at--ocsp-url, and/or a Rego policy toleratingx-nvidia-cert-ocsp-status.2. Binding via measured append-only state (before the app boundary)
dstack-utilemits agpu-attestationevent committingH(nvat_eat‖cert_chain‖claims)beforesystem-ready.find_eventboundary (breaks atsystem-ready,dstack-attest/src/attestation.rs:176) → RTMR3-bound on TDX.report_datastays for freshness + RA-TLS key binding only — not the GPU verdict.3. Fail-fast enforcement & app gate
dstack-gpu-attest.service: verify every configured GPU (num_gpus>0is measured), requirecc_mode==ON(rejectOFFandDEVTOOLS), set-srs 1only on pass.app-compose.serviceRequires=+After=dstack-gpu-attest.service; fail closed (nosystem-ready, no workload).GetGpuAttestation()/EnsureGpuReady()so apps can confirm/enable before use.4. KMS gating, verifier surfacing, docs
gpu-attestationevent (likecompose-hash).VerificationDetails.security-model.md: state the guarantee (measured-guest-vouches + channel-bound) and the co-location residual.Caveats
tpm_runtime_pcr()=None,has_tdx()=false);decode_app_info_sev_snpreads identity from launch-timeHOST_DATA/MrConfigV3and ignores the runtime log. SNP needs an SVSM/coconut-vTPM (PCR channel) before this works — until then SNP GPU attestation is strictly weaker than TDX. (See AMD SEV-SNP support (tracking) #713.)Hardware validation checklist (unconfirmed in docs)
nextUpdatewindow; warm-cache ride-through.libnvatenv-var spellings; trust-root store overridability.-srs 1error or no-op on a CC-off GPU?KEY_EXCHANGEkey (optional session-binding stretch)?