Skip to content

[Question] Request for detailed hardware specs and quantization details #1805

Description

@baigongli

Hi team, great work on InferenceX. I couldn't find detailed hardware configurations in the docs and need clarification on a few critical points to interpret the benchmarks correctly:

  1. GPU Details:
    • Confirmed form factor: Are the H200 SXM5 or PCIe? (Docs mention "sxm" but lack specifics).
    • VRAM: Is it the standard 141GB per GPU?
    • Host: What are the CPU models and System RAM sizes used?
  2. Networking:
    • For multi-node tests, what is the interconnect? Specifically, are you using ConnectX-7 NICs and what is the bandwidth (400G/800G)?
  3. Model Quantization (GLM-4/5.1):
    • Native precision for GLM models is typically BF16. The results only show FP4/FP8.
    • Are these benchmarks running quantized versions of the models?
    • Were any BF16 (native) baselines tested for comparison?
  4. DeepSeek-V4 Pro Memory Fit:
    • DeepSeek-V4 Pro (even with aggressive FP8) reportedly requires ~1.6TB of VRAM.
    • How does this fit on an 8x H200 (141GB) node (Total ~1.1TB)?
    • Is there heavy offloading to CPU/RAM involved, or is a specific sparse/MoE loading strategy used that reduces the active memory footprint significantly below 1.6TB?

A quick update to the README with a hardware spec table and quantization details would be very helpful. Thanks!

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions