[Question] Request for detailed hardware specs and quantization details

Hi team, great work on InferenceX. I couldn't find detailed hardware configurations in the docs and need clarification on a few critical points to interpret the benchmarks correctly:
1. GPU Details:
   - Confirmed form factor: Are the H200 SXM5 or PCIe? (Docs mention "sxm" but lack specifics).
   - VRAM: Is it the standard 141GB per GPU?
   - Host: What are the CPU models and System RAM sizes used?
2. Networking:
    - For multi-node tests, what is the interconnect? Specifically, are you using ConnectX-7 NICs and what is the bandwidth (400G/800G)?
3. Model Quantization (GLM-4/5.1):
   - Native precision for GLM models is typically BF16. The results only show FP4/FP8.
   - Are these benchmarks running quantized versions of the models?
   - Were any BF16 (native) baselines tested for comparison?
5. DeepSeek-V4 Pro Memory Fit:
   - DeepSeek-V4 Pro (even with aggressive FP8) reportedly requires ~1.6TB of VRAM.
   - How does this fit on an 8x H200 (141GB) node (Total ~1.1TB)?
   - Is there heavy offloading to CPU/RAM involved, or is a specific sparse/MoE loading strategy used that reduces the active memory footprint significantly below 1.6TB?

A quick update to the README with a hardware spec table and quantization details would be very helpful. Thanks!


<img width="2400" height="668" alt="Image" src="https://github.com/user-attachments/assets/e0915e3c-7667-4fba-afee-351d13709982" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Request for detailed hardware specs and quantization details #1805

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question] Request for detailed hardware specs and quantization details #1805

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions