Skip to content

RTX 5080 (GB203): GSP-RM heartbeat timeout -> Xid 154 under sustained LLM inference — non-thermal, concurrency-correlated; reproduced on vLLM, llama.cpp and TensorRT-LLM; 4 working mitigations #1200

@rogerioheringer

Description

@rogerioheringer

Environment

  • GPU: NVIDIA GeForce RTX 5080 16GB (GB203), VBIOS 98.03.3B.00.B4
  • Driver: 595.71.05 (open kernel modules), Ubuntu, kernel 6.8.0-101-generic
  • Workload: 24/7 batch LLM inference (RAG enrichment), Qwen3-14B

Symptom

After roughly 1 hour of sustained load, GSP firmware heartbeat dies:

NVRM: _kgspRpcRecvPoll: GSP RM heartbeat timed out
NVRM: _kgspIsHeartbeatTimedOut: Heartbeat timed out, currentTimeMs 3175895033 heartbeat 19834604 heartbeatWithOffsetMs 3175815351 diff 79682 timeout 5200
NVRM: _kgspRpcRecvPoll: LibOS heartbeat timed out
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

After the Xid: telemetry goes N/A (util/clocks/power; temperature still reads), VRAM becomes phantom (allocated, no owner), nvidia-smi --gpu-reset is unsupported on consumer RTX, and if left alone the host eventually hard-freezes (SSH dead, nvidia_uvm deadlock). Only recovery is a reboot — note that shutdown can take ~10 min due to D-state processes; if fully wedged, SysRq (echo b > /proc/sysrq-trigger) is the fallback.

It is NOT thermal

Crashed repeatedly at 38-57 C with fans at 96% and additional external cooling — nowhere near throttle. What correlates is concurrent request count / FP4 GEMM burst size and CUDA graphs.

Engines tested (same GPU, same driver)

Engine Config Result
TensorRT-LLM 1.2.1 NVFP4 (modelopt checkpoint) worst — GSP crash within minutes of serving FP4
vLLM 0.22.1 NVFP4 (compressed-tensors) + FlashInfer + kv-fp8, CUDA graphs ON Xid 154 after ~60 min at 16 concurrent requests; raising to 24 concurrent crashed it in 10 min
vLLM 0.22.1 same, --enforce-eager previously crashed at ~23 min under NVFP4; AWQ-marlin INT4 + eager was stable
llama.cpp (llama-server) GGUF Q4, CUDA graphs ON (default) Xid 154 after ~9 h of accumulated load; stable for hours with GGML_CUDA_DISABLE_GRAPHS=1
Ollama GGUF Q4 days of uptime without a crash

Pattern: the heavier/burstier the FP4 tensor-core usage and the more CUDA-graph replays, the faster GSP loses its heartbeat. INT4 weight-only (dequant-to-FP16 compute) configurations are far more stable than native FP4 GEMM paths.

Mitigations that hold it in production (combined)

  1. Cap concurrent requests at ~12-14 — decode throughput saturates there anyway (~730 tok/s for a 14B FP4 model); higher concurrency gives zero extra throughput, only crashes.
  2. Preventive engine restart every 55 min (systemd timer) — stays under the observed ~60 min MTBF; costs ~2-3 min/h.
  3. Disable CUDA graphs (vLLM --enforce-eager, llama.cpp GGML_CUDA_DISABLE_GRAPHS=1) where the 5-10% perf hit is acceptable.
  4. Watchdog polling engine metrics + dmesg for Xid that force-reboots automatically, plus a 10s-interval CSV logger (temp/clocks/power/pstate/concurrency/Xid) for forensics — happy to share data from the next crash if useful.

Power limit at 400W and NVreg_EnableGpuFirmware=0 were also tried; the latter is not a viable long-term option on this stack.

Related

Looks adjacent to #1045 (RTX 5080 Xid 119 -> 154 chain) and #1111 (GSP halt on sm_120 under sustained llama.cpp): same family of GSP firmware losing heartbeat under sustained tensor-core-heavy inference on Blackwell consumer parts.

Question: is there a GSP firmware fix for this in newer driver branches, or any supported NVreg knob to stabilize FP4-heavy workloads on GB203?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions