RTX 5080 (GB203): GSP-RM heartbeat timeout -> Xid 154 under sustained LLM inference — non-thermal, concurrency-correlated; reproduced on vLLM, llama.cpp and TensorRT-LLM; 4 working mitigations

## Environment
- GPU: NVIDIA GeForce RTX 5080 16GB (GB203), VBIOS 98.03.3B.00.B4
- Driver: **595.71.05** (open kernel modules), Ubuntu, kernel 6.8.0-101-generic
- Workload: 24/7 batch LLM inference (RAG enrichment), Qwen3-14B

## Symptom
After roughly **1 hour of sustained load**, GSP firmware heartbeat dies:

```
NVRM: _kgspRpcRecvPoll: GSP RM heartbeat timed out
NVRM: _kgspIsHeartbeatTimedOut: Heartbeat timed out, currentTimeMs 3175895033 heartbeat 19834604 heartbeatWithOffsetMs 3175815351 diff 79682 timeout 5200
NVRM: _kgspRpcRecvPoll: LibOS heartbeat timed out
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
```

After the Xid: telemetry goes N/A (util/clocks/power; temperature still reads), VRAM becomes phantom (allocated, no owner), `nvidia-smi --gpu-reset` is unsupported on consumer RTX, and if left alone the host eventually hard-freezes (SSH dead, `nvidia_uvm` deadlock). Only recovery is a reboot — note that shutdown can take ~10 min due to D-state processes; if fully wedged, SysRq (`echo b > /proc/sysrq-trigger`) is the fallback.

## It is NOT thermal
Crashed repeatedly at **38-57 C** with fans at 96% and additional external cooling — nowhere near throttle. What correlates is **concurrent request count / FP4 GEMM burst size** and **CUDA graphs**.

## Engines tested (same GPU, same driver)
| Engine | Config | Result |
|---|---|---|
| **TensorRT-LLM 1.2.1** | NVFP4 (modelopt checkpoint) | **worst** — GSP crash within minutes of serving FP4 |
| **vLLM 0.22.1** | NVFP4 (compressed-tensors) + FlashInfer + kv-fp8, CUDA graphs ON | Xid 154 after **~60 min** at 16 concurrent requests; raising to 24 concurrent crashed it in **10 min** |
| **vLLM 0.22.1** | same, `--enforce-eager` | previously crashed at ~23 min under NVFP4; AWQ-marlin INT4 + eager was stable |
| **llama.cpp** (llama-server) | GGUF Q4, CUDA graphs ON (default) | Xid 154 after ~9 h of accumulated load; **stable for hours with `GGML_CUDA_DISABLE_GRAPHS=1`** |
| **Ollama** | GGUF Q4 | **days** of uptime without a crash |

Pattern: the heavier/burstier the FP4 tensor-core usage and the more CUDA-graph replays, the faster GSP loses its heartbeat. INT4 weight-only (dequant-to-FP16 compute) configurations are far more stable than native FP4 GEMM paths.

## Mitigations that hold it in production (combined)
1. **Cap concurrent requests at ~12-14** — decode throughput saturates there anyway (~730 tok/s for a 14B FP4 model); higher concurrency gives zero extra throughput, only crashes.
2. **Preventive engine restart every 55 min** (systemd timer) — stays under the observed ~60 min MTBF; costs ~2-3 min/h.
3. **Disable CUDA graphs** (vLLM `--enforce-eager`, llama.cpp `GGML_CUDA_DISABLE_GRAPHS=1`) where the 5-10% perf hit is acceptable.
4. **Watchdog** polling engine metrics + dmesg for Xid that force-reboots automatically, plus a 10s-interval CSV logger (temp/clocks/power/pstate/concurrency/Xid) for forensics — happy to share data from the next crash if useful.

Power limit at 400W and `NVreg_EnableGpuFirmware=0` were also tried; the latter is not a viable long-term option on this stack.

## Related
Looks adjacent to #1045 (RTX 5080 Xid 119 -> 154 chain) and #1111 (GSP halt on sm_120 under sustained llama.cpp): same family of GSP firmware losing heartbeat under sustained tensor-core-heavy inference on Blackwell consumer parts.

**Question:** is there a GSP firmware fix for this in newer driver branches, or any supported `NVreg` knob to stabilize FP4-heavy workloads on GB203?


Engine	Config	Result
TensorRT-LLM 1.2.1	NVFP4 (modelopt checkpoint)	worst — GSP crash within minutes of serving FP4
vLLM 0.22.1	NVFP4 (compressed-tensors) + FlashInfer + kv-fp8, CUDA graphs ON	Xid 154 after ~60 min at 16 concurrent requests; raising to 24 concurrent crashed it in 10 min
vLLM 0.22.1	same, `--enforce-eager`	previously crashed at ~23 min under NVFP4; AWQ-marlin INT4 + eager was stable
llama.cpp (llama-server)	GGUF Q4, CUDA graphs ON (default)	Xid 154 after ~9 h of accumulated load; stable for hours with `GGML_CUDA_DISABLE_GRAPHS=1`
Ollama	GGUF Q4	days of uptime without a crash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTX 5080 (GB203): GSP-RM heartbeat timeout -> Xid 154 under sustained LLM inference — non-thermal, concurrency-correlated; reproduced on vLLM, llama.cpp and TensorRT-LLM; 4 working mitigations #1200

Environment

Symptom

It is NOT thermal

Engines tested (same GPU, same driver)

Mitigations that hold it in production (combined)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RTX 5080 (GB203): GSP-RM heartbeat timeout -> Xid 154 under sustained LLM inference — non-thermal, concurrency-correlated; reproduced on vLLM, llama.cpp and TensorRT-LLM; 4 working mitigations #1200

Description

Environment

Symptom

It is NOT thermal

Engines tested (same GPU, same driver)

Mitigations that hold it in production (combined)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions