Environment
- GPU: NVIDIA GeForce RTX 5080 16GB (GB203), VBIOS 98.03.3B.00.B4
- Driver: 595.71.05 (open kernel modules), Ubuntu, kernel 6.8.0-101-generic
- Workload: 24/7 batch LLM inference (RAG enrichment), Qwen3-14B
Symptom
After roughly 1 hour of sustained load, GSP firmware heartbeat dies:
NVRM: _kgspRpcRecvPoll: GSP RM heartbeat timed out
NVRM: _kgspIsHeartbeatTimedOut: Heartbeat timed out, currentTimeMs 3175895033 heartbeat 19834604 heartbeatWithOffsetMs 3175815351 diff 79682 timeout 5200
NVRM: _kgspRpcRecvPoll: LibOS heartbeat timed out
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
After the Xid: telemetry goes N/A (util/clocks/power; temperature still reads), VRAM becomes phantom (allocated, no owner), nvidia-smi --gpu-reset is unsupported on consumer RTX, and if left alone the host eventually hard-freezes (SSH dead, nvidia_uvm deadlock). Only recovery is a reboot — note that shutdown can take ~10 min due to D-state processes; if fully wedged, SysRq (echo b > /proc/sysrq-trigger) is the fallback.
It is NOT thermal
Crashed repeatedly at 38-57 C with fans at 96% and additional external cooling — nowhere near throttle. What correlates is concurrent request count / FP4 GEMM burst size and CUDA graphs.
Engines tested (same GPU, same driver)
| Engine |
Config |
Result |
| TensorRT-LLM 1.2.1 |
NVFP4 (modelopt checkpoint) |
worst — GSP crash within minutes of serving FP4 |
| vLLM 0.22.1 |
NVFP4 (compressed-tensors) + FlashInfer + kv-fp8, CUDA graphs ON |
Xid 154 after ~60 min at 16 concurrent requests; raising to 24 concurrent crashed it in 10 min |
| vLLM 0.22.1 |
same, --enforce-eager |
previously crashed at ~23 min under NVFP4; AWQ-marlin INT4 + eager was stable |
| llama.cpp (llama-server) |
GGUF Q4, CUDA graphs ON (default) |
Xid 154 after ~9 h of accumulated load; stable for hours with GGML_CUDA_DISABLE_GRAPHS=1 |
| Ollama |
GGUF Q4 |
days of uptime without a crash |
Pattern: the heavier/burstier the FP4 tensor-core usage and the more CUDA-graph replays, the faster GSP loses its heartbeat. INT4 weight-only (dequant-to-FP16 compute) configurations are far more stable than native FP4 GEMM paths.
Mitigations that hold it in production (combined)
- Cap concurrent requests at ~12-14 — decode throughput saturates there anyway (~730 tok/s for a 14B FP4 model); higher concurrency gives zero extra throughput, only crashes.
- Preventive engine restart every 55 min (systemd timer) — stays under the observed ~60 min MTBF; costs ~2-3 min/h.
- Disable CUDA graphs (vLLM
--enforce-eager, llama.cpp GGML_CUDA_DISABLE_GRAPHS=1) where the 5-10% perf hit is acceptable.
- Watchdog polling engine metrics + dmesg for Xid that force-reboots automatically, plus a 10s-interval CSV logger (temp/clocks/power/pstate/concurrency/Xid) for forensics — happy to share data from the next crash if useful.
Power limit at 400W and NVreg_EnableGpuFirmware=0 were also tried; the latter is not a viable long-term option on this stack.
Related
Looks adjacent to #1045 (RTX 5080 Xid 119 -> 154 chain) and #1111 (GSP halt on sm_120 under sustained llama.cpp): same family of GSP firmware losing heartbeat under sustained tensor-core-heavy inference on Blackwell consumer parts.
Question: is there a GSP firmware fix for this in newer driver branches, or any supported NVreg knob to stabilize FP4-heavy workloads on GB203?
Environment
Symptom
After roughly 1 hour of sustained load, GSP firmware heartbeat dies:
After the Xid: telemetry goes N/A (util/clocks/power; temperature still reads), VRAM becomes phantom (allocated, no owner),
nvidia-smi --gpu-resetis unsupported on consumer RTX, and if left alone the host eventually hard-freezes (SSH dead,nvidia_uvmdeadlock). Only recovery is a reboot — note that shutdown can take ~10 min due to D-state processes; if fully wedged, SysRq (echo b > /proc/sysrq-trigger) is the fallback.It is NOT thermal
Crashed repeatedly at 38-57 C with fans at 96% and additional external cooling — nowhere near throttle. What correlates is concurrent request count / FP4 GEMM burst size and CUDA graphs.
Engines tested (same GPU, same driver)
--enforce-eagerGGML_CUDA_DISABLE_GRAPHS=1Pattern: the heavier/burstier the FP4 tensor-core usage and the more CUDA-graph replays, the faster GSP loses its heartbeat. INT4 weight-only (dequant-to-FP16 compute) configurations are far more stable than native FP4 GEMM paths.
Mitigations that hold it in production (combined)
--enforce-eager, llama.cppGGML_CUDA_DISABLE_GRAPHS=1) where the 5-10% perf hit is acceptable.Power limit at 400W and
NVreg_EnableGpuFirmware=0were also tried; the latter is not a viable long-term option on this stack.Related
Looks adjacent to #1045 (RTX 5080 Xid 119 -> 154 chain) and #1111 (GSP halt on sm_120 under sustained llama.cpp): same family of GSP firmware losing heartbeat under sustained tensor-core-heavy inference on Blackwell consumer parts.
Question: is there a GSP firmware fix for this in newer driver branches, or any supported
NVregknob to stabilize FP4-heavy workloads on GB203?