Hi team, great work on InferenceX. I couldn't find detailed hardware configurations in the docs and need clarification on a few critical points to interpret the benchmarks correctly:
- GPU Details:
- Confirmed form factor: Are the H200 SXM5 or PCIe? (Docs mention "sxm" but lack specifics).
- VRAM: Is it the standard 141GB per GPU?
- Host: What are the CPU models and System RAM sizes used?
- Networking:
- For multi-node tests, what is the interconnect? Specifically, are you using ConnectX-7 NICs and what is the bandwidth (400G/800G)?
- Model Quantization (GLM-4/5.1):
- Native precision for GLM models is typically BF16. The results only show FP4/FP8.
- Are these benchmarks running quantized versions of the models?
- Were any BF16 (native) baselines tested for comparison?
- DeepSeek-V4 Pro Memory Fit:
- DeepSeek-V4 Pro (even with aggressive FP8) reportedly requires ~1.6TB of VRAM.
- How does this fit on an 8x H200 (141GB) node (Total ~1.1TB)?
- Is there heavy offloading to CPU/RAM involved, or is a specific sparse/MoE loading strategy used that reduces the active memory footprint significantly below 1.6TB?
A quick update to the README with a hardware spec table and quantization details would be very helpful. Thanks!

Hi team, great work on InferenceX. I couldn't find detailed hardware configurations in the docs and need clarification on a few critical points to interpret the benchmarks correctly:
A quick update to the README with a hardware spec table and quantization details would be very helpful. Thanks!