From 08adc84193195ca8fddd2f61b0b56d056331f804 Mon Sep 17 00:00:00 2001 From: Liang Juhao Date: Fri, 15 May 2026 11:10:15 +0800 Subject: [PATCH 1/3] =?UTF-8?q?feat:=20add=20NVIDIA=20vLLM=200.20.x=20runn?= =?UTF-8?q?er=20=E2=80=94=20nvidia=5Fvllm=5F0c1710bd?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the AccelMark runner for the upcoming vLLM 0.20.x major release on NVIDIA GPUs. This is the supersessor of nvidia_vllm_47f5d58e (vLLM 0.7.3) and tracks the 2026 stack: torch 2.11, CUDA 13.0 (12.8 still supported), HuggingFace Transformers v5, FlashAttention 4 MLA prefill, Model Runner V2, and TurboQuant 2-bit KV cache. What is included: * runners/nvidia_vllm_0c1710bd/ — runner.py, meta.json (with supersedes_chain pointing at the 0.7.3 runner and suite_support self-declaration), requirements.txt, README.md * configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example The README platforms matrix updates automatically from meta.json — no shared file is touched. The 0.7.3 runner remains in the matrix until a successful smoke result on 0.20 lands, at which point its meta.json will gain deprecated_by in a follow-up PR. Capability flags: * SUPPORTED_QUANTIZATION_BACKENDS adds 'turboquant' on top of fp8 / compressed-tensors / gptq_marlin. * Framework version string now reports vllm + transformers together so v4 vs v5 transformers transitions are visible in result.json. * Reuses EngineArgs-field filtering to absorb new 0.20 engine kwargs without breaking old configs. Initial commit, not yet validated on hardware; all suites are marked "pending" in suite_support. Co-authored-by: Cursor --- README.md | 1 + .../runner_nvidia_vllm_0c1710bd.yaml.example | 75 +++ runners/nvidia_vllm_0c1710bd/README.md | 104 ++++ runners/nvidia_vllm_0c1710bd/meta.json | 21 + runners/nvidia_vllm_0c1710bd/requirements.txt | 33 ++ runners/nvidia_vllm_0c1710bd/runner.py | 484 ++++++++++++++++++ 6 files changed, 718 insertions(+) create mode 100644 configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example create mode 100644 runners/nvidia_vllm_0c1710bd/README.md create mode 100644 runners/nvidia_vllm_0c1710bd/meta.json create mode 100644 runners/nvidia_vllm_0c1710bd/requirements.txt create mode 100644 runners/nvidia_vllm_0c1710bd/runner.py diff --git a/README.md b/README.md index 922c479..5e29038 100644 --- a/README.md +++ b/README.md @@ -88,6 +88,7 @@ Reference runners live under `runners/` (see each folder’s `meta.json`). The t | Hardware | Runner folder | Framework | A | B | C | D | E | F | G | |---|---|---|:-:|:-:|:-:|:-:|:-:|:-:|:-:| | NVIDIA GPU | `nvidia_sglang_c43a8309` | SGLang | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| NVIDIA GPU | `nvidia_vllm_0c1710bd` | vLLM | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | | NVIDIA GPU | `nvidia_vllm_47f5d58e` | vLLM | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | NVIDIA V100 (SM70) | `nvidia_onecat_vllm_12a253c2` | 1Cat-vLLM | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | — | ⋯ | | AMD GPU | `amd_vllm_rocm_6c18cd8f` | vLLM-ROCm | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | diff --git a/configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example b/configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example new file mode 100644 index 0000000..dec3cbe --- /dev/null +++ b/configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example @@ -0,0 +1,75 @@ +# AccelMark runner config — nvidia_vllm_0c1710bd (vLLM 0.20 on NVIDIA) +# +# Copy this file to runner_nvidia_vllm_0c1710bd.yaml (remove .example suffix) +# and edit as needed for your hardware. The actual .yaml is gitignored. +# +# These settings adapt the runner to your hardware environment. +# They are recorded in result.json task.extra_config for transparency +# but are NOT part of the benchmark identity (not hashed into run_id). +# +# Merge priority: CLI flags > suite-specific > global defaults > runner defaults + +# ── Global defaults (apply to all suites) ───────────────────────────────────── + +# Tensor parallel size — number of GPUs to use (default: 1) +tensor_parallel_size: 1 + +# Disable CUDAGraph/compilation. Required for pre-Ampere GPUs (V100, T4). +# Set to true if you encounter CUDA graph errors on older hardware. +enforce_eager: false + +# Maximum number of sequences in a batch (default: 512). +# Reduce on low-memory GPUs: 128 for 16 GB, 64 for 12 GB or less. +max_num_seqs: 512 + +# Fraction of GPU memory reserved for the KV cache (default: 0.90). +# Reduce if you get OOM errors: try 0.80 for tighter memory budgets. +gpu_memory_utilization: 0.90 + +# Pass-through kwargs forwarded directly to vLLM LLM() / AsyncEngineArgs(). +# Use for any vLLM setting not listed above. See vLLM docs for valid keys: +# https://docs.vllm.ai/en/latest/api/vllm/engine/arg_utils.html +# +# 0.20-specific knobs you may want to set (uncomment as needed): +# engine_kwargs: +# # FlashAttention 4 is the 0.20 default for MLA prefill; uncomment to pin +# # for reproducibility or to force back to FA3 / Triton fallback. +# # attention_backend: FLASH_ATTN_4 +# +# # Model Runner V2 + new CUDA-graph paths: +# # compilation_config: +# # cudagraph_mode: full_and_piecewise +# +# # TurboQuant 2-bit KV cache (suite_C, --precision turboquant): +# # kv_cache_dtype: turboquant +# +# swap_space: 8 +# max_seq_len_to_capture: 4096 + +# ── Suite-specific overrides ─────────────────────────────────────────────────── +# Keys here override the global defaults above for a specific suite only. +# Only the section matching the current suite is used — other suite sections +# are never loaded or recorded. + +suites: + suite_D: + # Long-context suite — reduce batch size and reserve more memory. + max_num_seqs: 64 + gpu_memory_utilization: 0.85 + + suite_F: + # Consumer/edge GPU — enforce_eager often needed for pre-Ampere chips + # enforce_eager: true + max_num_seqs: 128 + +# ── Speculative decoding (suite_A extra scenario) ───────────────────────────── +# Uncomment this section to run the speculative scenario. +# The draft model runs on the same GPU as the target model. +# speculative decoding is configured via vLLM engine_kwargs. +# +# suites: +# suite_A: +# engine_kwargs: +# speculative_model: "meta-llama/Llama-3.2-1B-Instruct" +# num_speculative_tokens: 4 +# speculative_draft_tensor_parallel_size: 1 diff --git a/runners/nvidia_vllm_0c1710bd/README.md b/runners/nvidia_vllm_0c1710bd/README.md new file mode 100644 index 0000000..780596f --- /dev/null +++ b/runners/nvidia_vllm_0c1710bd/README.md @@ -0,0 +1,104 @@ +# nvidia_vllm_0c1710bd — NVIDIA vLLM Runner (0.20.x line) + +AccelMark reference runner for NVIDIA GPUs running **vLLM 0.20.x** — +the 2026 major release. + +This runner supersedes [`nvidia_vllm_47f5d58e`](../nvidia_vllm_47f5d58e/) +(vLLM 0.7.3). The predecessor remains runnable; this folder is what new +results on Ampere / Hopper / Blackwell hosts should reference going forward. + +## What changed vs nvidia_vllm_47f5d58e + +| Area | 0.7.3 (predecessor) | 0.20.x (this runner) | +|---|---|---| +| Default CUDA | 12.1 | **13.0** (12.8 still supported via the PyTorch cu128 index) | +| PyTorch | 2.5.1 | **2.11.0** | +| Python | 3.10+ | 3.10+ (3.14 newly supported) | +| HuggingFace Transformers | v4.57 | **v5.x** | +| FlashAttention | FA2 | **FA4** (MLA prefill default) | +| Quantization backends declared | fp8, compressed-tensors, gptq_marlin | + **turboquant** (2-bit KV cache, 4x KV capacity) | +| Model Runner | V1 | **V2** (Eagle prefill full-CUDA-graph, fused probabilistic rejection sampling) | +| DeepSeek V4 | — | ✅ | +| Result version string | `vllm 0.7.3` | `vllm 0.20.1+transformers-5.1.0` | + +Detailed release notes: +[vLLM v0.20.0](https://github.com/vllm-project/vllm/releases/tag/v0.20.0) +· [vLLM v0.20.1 patch](https://github.com/vllm-project/vllm/releases/tag/v0.20.1). + +## Supported suites + +Same coverage as the predecessor runner — **all suites A–G**. See +[`runners/nvidia_vllm_47f5d58e/README.md`](../nvidia_vllm_47f5d58e/README.md) +for the per-GPU hardware compatibility matrix; the same rows apply here +because the runner code is a structural clone. + +## Installation + +```bash +# 1. Standard install — CUDA 13.0 stack +pip install -r runners/nvidia_vllm_0c1710bd/requirements.txt + +# 2. CUDA 12.8 stack (for hosts still on the cu128 driver): +pip install -r runners/nvidia_vllm_0c1710bd/requirements.txt \ + --extra-index-url https://download.pytorch.org/whl/cu128 +``` + +> Older runners pinned `nvidia-cublas-cu12`; on 0.20 + CUDA 13.0 use +> `nvidia-cublas-cu13` if you encounter the cuBLAS SIGFPE on large-memory +> GPUs (same fix philosophy as the predecessor's README — only the package +> name changes). + +## Basic usage + +Identical to the predecessor: + +```bash +python run.py --runner nvidia_vllm_0c1710bd --suite suite_A +python run.py --runner nvidia_vllm_0c1710bd --suite suite_B \ + --tensor-parallel-size 4 +``` + +## 0.20-specific knobs you may want to enable + +`engine_kwargs` in the runner config are passed straight to +`AsyncEngineArgs` / `LLM`. The runner already filters unknown fields, so +adding 0.20-only keys is safe even if you downgrade vLLM later — they will +be dropped with a warning rather than blowing up at startup. + +```yaml +# configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml +engine_kwargs: + # FlashAttention 4 (default on 0.20 — listed here only if you need + # to pin it for reproducibility): + attention_backend: FLASH_ATTN_4 + # CUDA graph improvements added in 0.20: + compilation_config: + cudagraph_mode: full_and_piecewise + # TurboQuant 2-bit KV cache (suite C with --precision turboquant): + # kv_cache_dtype: turboquant +``` + +## Runner config + +Copy the example: + +```bash +cp configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example \ + configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml +``` + +Field names and defaults are identical to the predecessor — see +[`runner_nvidia_vllm_47f5d58e.yaml.example`](../../configs/runner_configs/runner_nvidia_vllm_47f5d58e.yaml.example) +for the field reference. + +## Status + +- **Code:** structurally identical to the predecessor + the small additions + documented above. The change is principally a dependency bump. +- **Validation:** not yet run end-to-end on a 0.20 install at the time of + commit. The predecessor's test_smoke.py path applies once the test file is + ported over. +- **Predecessor:** `nvidia_vllm_47f5d58e/meta.json` will receive a + `deprecated_by` pointer in a follow-up PR once a smoke result against this + runner has been verified. Until then the predecessor remains the + recommended runner for production result submissions. diff --git a/runners/nvidia_vllm_0c1710bd/meta.json b/runners/nvidia_vllm_0c1710bd/meta.json new file mode 100644 index 0000000..876d403 --- /dev/null +++ b/runners/nvidia_vllm_0c1710bd/meta.json @@ -0,0 +1,21 @@ +{ + "id": "nvidia_vllm_0c1710bd", + "platform": "nvidia", + "name": "vLLM 0.20 on NVIDIA", + "framework": "vLLM", + "submitted_by": "JuhaoLiang1997", + "description": "AccelMark reference runner for NVIDIA GPUs running vLLM 0.20.x. Updates the predecessor (nvidia_vllm_47f5d58e, vLLM 0.7.3) to the 2026 vLLM major release: torch 2.11, CUDA 13.0 default (12.8 still supported), HuggingFace Transformers v5, FlashAttention 4 MLA prefill, Model Runner V2, and TurboQuant 2-bit KV cache. Supports all suites A–G.", + "supersedes_chain": ["nvidia_vllm_47f5d58e"], + "notes": "Initial 0.20.x runner. SUPPORTED_QUANTIZATION_BACKENDS adds 'turboquant' on top of fp8 / compressed-tensors / gptq_marlin. Framework version string now reports vllm + transformers together so v4 vs v5 transformers transitions are visible in result.json. Reuses EngineArgs-field filtering to absorb new 0.20 engine kwargs without breaking old configs. Predecessor is still runnable and remains the recommended choice for CUDA 11.8 / pre-Ampere hosts; deprecated_by will be set on nvidia_vllm_47f5d58e in a follow-up PR once a smoke result on this runner exists.", + "created": "2026-05-15", + "hardware_label": null, + "suite_support": { + "A": "pending", + "B": "pending", + "C": "pending", + "D": "pending", + "E": "pending", + "F": "pending", + "G": "pending" + } +} diff --git a/runners/nvidia_vllm_0c1710bd/requirements.txt b/runners/nvidia_vllm_0c1710bd/requirements.txt new file mode 100644 index 0000000..50b1db7 --- /dev/null +++ b/runners/nvidia_vllm_0c1710bd/requirements.txt @@ -0,0 +1,33 @@ +# AccelMark -- NVIDIA platform dependencies (vLLM 0.20.x line) +# Reference tested combination: torch 2.11 + vLLM 0.20.1 + CUDA 13.0 +# (CUDA 12.8 still supported via --extra-index-url; see README.md) +# +# Core +torch==2.11.0 +torchvision==0.26.0 +torchaudio==2.11.0 + +# LLM inference +vllm==0.20.1 + +# Transformers v5 (introduced as required by vLLM 0.20.0) +transformers==5.1.0 +tokenizers==0.23.0 +huggingface-hub==0.36.0 +accelerate==1.10.1 +safetensors==0.7.0 + +# AccelMark dependencies +numpy==1.26.4 +jsonschema==4.25.1 +psutil==7.1.0 +tqdm==4.67.1 + +# NVIDIA monitoring (for power and GPU stats) +nvidia-ml-py==13.580.82 + +# Async support +aiohttp==3.12.15 + +# Config file parsing +PyYAML==6.0.2 diff --git a/runners/nvidia_vllm_0c1710bd/runner.py b/runners/nvidia_vllm_0c1710bd/runner.py new file mode 100644 index 0000000..4321d96 --- /dev/null +++ b/runners/nvidia_vllm_0c1710bd/runner.py @@ -0,0 +1,484 @@ +""" +AccelMark — NVIDIA vLLM benchmark script (vLLM 0.20.x line). + +Implements BenchmarkRunner for vLLM 0.20.x on NVIDIA GPUs. This runner +supersedes ``nvidia_vllm_47f5d58e`` (the 0.7.3 line) and updates the +reference stack to the 2026 vLLM major release. + +What changed relative to the predecessor runner: + + - **Dependencies bumped** to the vLLM 0.20.x reference: torch 2.11, + CUDA 13.0 (or 12.8 via opt-in extra-index), HuggingFace Transformers v5, + Python 3.14 compatible. See ``requirements.txt`` for the pinned list. + - **TurboQuant 2-bit KV cache** declared as a quantization backend + (``turboquant``) — new in 0.20.0 and not available on older vLLM lines. + Other backends (FP8, compressed-tensors, gptq_marlin) are preserved. + - **Framework version string** now reports both ``vllm`` and + ``transformers`` versions so result.json captures the v5 transition. + +Everything else is byte-identical in structure to the previous runner — +0.20 keeps the ``LLM`` / ``AsyncLLMEngine`` / ``SamplingParams`` public API. +The EngineArgs-field filter already handles unknown 0.20 kwargs gracefully, +so existing runner-config YAMLs continue to work after upgrade. + +All orchestration logic lives in runners/benchmark_runner.py. +""" + +import asyncio +import sys +import time +from pathlib import Path +from typing import Optional + +# Add repo root to path +_REPO_ROOT = Path(__file__).resolve().parent.parent.parent +sys.path.insert(0, str(_REPO_ROOT)) + +import torch +from vllm import LLM, AsyncLLMEngine, SamplingParams +from vllm.engine.arg_utils import AsyncEngineArgs +from transformers import AutoTokenizer + +from runners.benchmark_runner import BenchmarkRunner, InferenceRequest +from loadgen.types import InferenceResult + + + +# Suppress per-request vLLM logs by default +import logging +logging.getLogger("vllm.engine.async_llm_engine").setLevel(logging.WARNING) +logging.getLogger("vllm.engine.llm_engine").setLevel(logging.WARNING) + + +class VLLMRunner(BenchmarkRunner): + """AccelMark benchmark runner using vLLM on NVIDIA GPUs.""" + + SUPPORTS_STREAMING = True + SUPPORTS_BATCHING = True + SUPPORTS_ONLINE = True + SUPPORTS_MULTI_CHIP = True + + # vLLM on NVIDIA supports all precisions — hardware detection in BenchmarkRunner + # will automatically restrict to FP16 on V100/T4 + SUPPORTED_PRECISIONS = ["bf16", "fp16", "fp32"] + # 0.20.0 added the TurboQuant 2-bit KV cache backend (4x KV capacity vs FP16). + # FP8 / compressed-tensors / gptq_marlin remain from the 0.7.x baseline. + SUPPORTED_QUANTIZATION_BACKENDS = [ + "fp8", + "compressed-tensors", + "gptq_marlin", + "turboquant", + ] + + def __init__(self): + self.llm: LLM = None + self.engine: AsyncLLMEngine = None + self.tokenizer: AutoTokenizer = None + self.sampling_params: SamplingParams = None + self._loop: asyncio.AbstractEventLoop = None + + def _get_chip_count(self) -> int: + """Return the number of available CUDA GPUs.""" + try: + import torch + n = torch.cuda.device_count() + return n if n > 0 else 1 + except Exception: + return 1 + + def _get_framework_name(self) -> str: + return "vLLM" + + def _get_framework_version(self) -> str: + """Report vllm + transformers versions. + + vLLM 0.20 ships with Transformers v5 support; including the + transformers version in result.json makes it explicit when a result + was generated against the v4 vs v5 line. + """ + vllm_v = "unknown" + try: + import vllm + vllm_v = vllm.__version__ + except Exception: + pass + + tfm_v = None + try: + import transformers + tfm_v = transformers.__version__ + except Exception: + pass + + if tfm_v: + return f"{vllm_v}+transformers-{tfm_v}" + return vllm_v + + def load_model(self, model_path: str, parallelism: dict) -> None: + """Load model — sync LLM for offline/accuracy, async engine for streaming.""" + tp_size = parallelism["tensor_parallel_size"] + pp_size = parallelism["pipeline_parallel_size"] + ep_size = parallelism.get("expert_parallel_size", 1) + assert pp_size <= 1, "Pipeline parallelism is not supported in VLLMRunner" + + max_tokens = parallelism["max_tokens"] + max_model_len = parallelism["max_model_len"] + use_async = parallelism["use_async"] + enforce_eager = getattr(self, "_enforce_eager", False) + + cfg = getattr(self, "_runner_config", {}) + max_num_seqs = cfg.get("max_num_seqs", 512) + gpu_memory_util = cfg.get("gpu_memory_utilization", 0.90) + extra_kwargs = dict(cfg.get("engine_kwargs") or {}) + + # ── Filter engine_kwargs to only fields this vLLM version accepts ───── + # Avoids TypeError when the runner config YAML references a field that + # doesn't exist in the installed vLLM version (EngineArgs is a strict + # dataclass — unknown keyword arguments raise TypeError immediately). + try: + import dataclasses + from vllm.engine.arg_utils import EngineArgs as _EngineArgs + _valid = {f.name for f in dataclasses.fields(_EngineArgs)} + _dropped = {k: v for k, v in extra_kwargs.items() if k not in _valid} + if _dropped: + print(f" Warning: engine_kwargs keys not supported by this " + f"vLLM version and will be ignored: {list(_dropped)}") + extra_kwargs = {k: v for k, v in extra_kwargs.items() if k in _valid} + except Exception: + pass # If introspection fails, pass kwargs as-is and let vLLM report the error + + # Use precision resolved by BenchmarkRunner._resolve_precision() + effective_precision = getattr(self, "_effective_precision", "BF16").upper() + precision = getattr(self, "_precision", None) or effective_precision + + # dtype_override and quantization may be injected by benchmark_runner from + # precision_model_map entry fields (dtype_override, engine_kwargs.quantization). + # These take priority over the runner's own precision→dtype mapping below. + _dtype_override = getattr(self, "_precision_dtype_override", None) + _prec_eng_kwargs = dict(getattr(self, "_precision_engine_kwargs", None) or {}) + + quantization = _prec_eng_kwargs.pop("quantization", None) + + # Map native precision names to explicit dtypes. + # Quantized formats (anything not in this map) use dtype="auto" — vLLM reads + # the storage dtype from the checkpoint's config.json, and the quantization + # kernel is set explicitly via the `quantization` kwarg already populated above + # from precision_model_map engine_kwargs. No fallback guessing needed here. + _NATIVE_DTYPE_MAP = { + "BF16": "bfloat16", + "FP16": "float16", + "FP32": "float32", + } + dtype = _NATIVE_DTYPE_MAP.get(precision, "auto") + self._quantization_method = quantization # None for native, explicit str for quantized + + # dtype_override from precision_model_map wins over the mapping above. + # Used for e.g. FP16 baseline on pre-Ampere hardware (V100/T4). + if _dtype_override: + dtype = _dtype_override + + # Merge remaining precision_engine_kwargs (after popping quantization) into + # extra_kwargs so they reach LLM() / AsyncEngineArgs. Runner YAML engine_kwargs + # still take final precedence via the **extra_kwargs spread at the end. + if _prec_eng_kwargs: + _prec_eng_kwargs.update(extra_kwargs) # runner YAML wins on conflict + extra_kwargs = _prec_eng_kwargs + + print(f"Loading model: precision={precision}, dtype={dtype}" + + (f", quantization_method={self._quantization_method}" + if self._quantization_method else "")) + + self.tokenizer = AutoTokenizer.from_pretrained( + model_path, trust_remote_code=False + ) + + self.sampling_params = SamplingParams( + max_tokens=max_tokens, + temperature=0.0, + ) + + if not use_async: + llm_kwargs = dict( + model=model_path, + dtype=dtype, + tensor_parallel_size=tp_size, + trust_remote_code=False, + enforce_eager=enforce_eager, + max_num_seqs=max_num_seqs, + gpu_memory_utilization=gpu_memory_util, + **extra_kwargs, + ) + if ep_size > 1: + llm_kwargs["enable_expert_parallel"] = True + llm_kwargs["tensor_parallel_size"] = tp_size + if quantization: + llm_kwargs["quantization"] = quantization + if max_model_len: + llm_kwargs["max_model_len"] = max_model_len + self.llm = LLM(**llm_kwargs) + else: + self._loop = asyncio.new_event_loop() + asyncio.set_event_loop(self._loop) + engine_kwargs = dict( + model=model_path, + dtype=dtype, + tensor_parallel_size=tp_size, + trust_remote_code=False, + enforce_eager=enforce_eager, + gpu_memory_utilization=gpu_memory_util, + # engine_kwargs values override named fields above if the same key appears in both. + # This is intentional — engine_kwargs is the power-user escape hatch. + **extra_kwargs, + ) + if ep_size > 1: + engine_kwargs["enable_expert_parallel"] = True + if max_model_len: + engine_kwargs["max_model_len"] = max_model_len + engine_args = AsyncEngineArgs(**engine_kwargs) + self.engine = AsyncLLMEngine.from_engine_args(engine_args) + + def get_effective_dtype(self) -> Optional[str]: + """ + Report the actual compute dtype vLLM used after model loading. + + vLLM exposes the resolved dtype via model_config after initialization. + This captures cases like FP8 weights on A100 computing in BF16. + """ + try: + if self.llm is not None: + # Sync LLM path + dtype = self.llm.llm_engine.model_config.dtype + return str(dtype).replace("torch.", "") + elif self.engine is not None: + # Async engine path + dtype = self.engine.engine.model_config.dtype + return str(dtype).replace("torch.", "") + except Exception: + pass + # Fall back to declared dtype if introspection fails + return getattr(self, "_effective_dtype", None) + + def inference_fn_offline(self, requests: list[InferenceRequest]) -> list[InferenceResult]: + """Send all requests to vLLM at once. vLLM handles internal batching. + + total_time_ms in each returned InferenceResult is set to the wall-clock + elapsed time of the entire batch — NOT an individual per-request latency. + vLLM's sync LLM.generate() blocks until all requests finish, so there is + no per-request completion timestamp available. All results share the same + total_time_ms value, which is the correct denominator for throughput: + throughput = total_tokens / (elapsed_ms / 1000) + """ + formatted = [self._format_prompt(r.prompt) for r in requests] + t_start = time.perf_counter() + outputs = self.llm.generate(formatted, self.sampling_params) + elapsed = time.perf_counter() - t_start + + # Store output text for _run_accuracy_integrated() + self._last_accuracy_outputs = [o.outputs[0].text for o in outputs] + + results = [] + for output in outputs: + results.append(InferenceResult( + first_token_time_ms=None, + total_time_ms=elapsed * 1000, + output_tokens=len(output.outputs[0].token_ids), + input_tokens=len(output.prompt_token_ids), + success=True, + output_text=output.outputs[0].text, + )) + return results + + async def inference_fn_streaming(self, request: InferenceRequest) -> InferenceResult: + """Stream a single request, measuring TTFT.""" + from vllm.utils import random_uuid + + formatted = self._format_prompt(request.prompt) + request_id = random_uuid() + t_start = time.perf_counter() + first_token_time_ms = None + output_tokens = 0 + output_text = "" + + async for output in self.engine.generate( + formatted, self.sampling_params, request_id + ): + if ( + first_token_time_ms is None + and len(output.outputs[0].token_ids) > 0 + ): + first_token_time_ms = (time.perf_counter() - t_start) * 1000 + output_tokens = len(output.outputs[0].token_ids) + output_text = output.outputs[0].text + + total_time_ms = (time.perf_counter() - t_start) * 1000 + return InferenceResult( + first_token_time_ms=first_token_time_ms, + total_time_ms=total_time_ms, + output_tokens=output_tokens, + input_tokens=0, + success=True, + output_text=output_text, + ) + + async def inference_fn_token_stream(self, request: InferenceRequest): + """ + Async generator yielding decoded text deltas for the serve layer. + + Each yield is the delta text since the last output — new characters + only, not the full accumulated string. + + vLLM's engine.generate() yields cumulative outputs, so we track the + previous text length and slice off only the new portion each step. + """ + from vllm.utils import random_uuid + + formatted = self._format_prompt(request.prompt) + request_id = random_uuid() + prev_length = 0 + + async for output in self.engine.generate( + formatted, self.sampling_params, request_id + ): + current_text = output.outputs[0].text + delta = current_text[prev_length:] + if delta: + yield delta + prev_length = len(current_text) + + def get_peak_memory_gb(self) -> float: + try: + return torch.cuda.max_memory_allocated() / (1024 ** 3) + except Exception: + return None + + def release_resources(self) -> None: + """Release vLLM engines and distributed state.""" + if self.llm is not None: + try: + del self.llm + except Exception: + pass + self.llm = None + + if self.engine is not None: + try: + if self._loop and not self._loop.is_closed(): + self._loop.run_until_complete(self.engine.shutdown()) + except Exception: + pass + try: + del self.engine + except Exception: + pass + self.engine = None + + # Destroy vLLM's distributed state so the next engine initialisation + # creates a fresh TCPStore server. Must call destroy_model_parallel() + # first to clear vLLM's cached group references; only then is it safe + # to destroy the underlying torch process group. Skipping this step + # leaves torch.distributed.is_initialized()==True, which causes + # init_distributed_environment() to skip creating the new TCPStore + # server, so spawned worker processes can never connect (→ 600 s timeout). + try: + from vllm.distributed.parallel_state import cleanup_dist_env_and_memory + cleanup_dist_env_and_memory(shutdown_ray=False) + except Exception: + # Fallback for older vLLM builds that lack cleanup_dist_env_and_memory + try: + from vllm.distributed.parallel_state import ( + destroy_model_parallel, destroy_distributed_environment, + ) + destroy_model_parallel() + destroy_distributed_environment() + except Exception: + pass + + # Final guard: if torch.distributed is still initialized after the cleanup + # attempts above, destroy the default process group here. Without this, + # vLLM's init_distributed_environment() skips TCPStore server creation on + # the next LLM() init, so new worker processes can never join the barrier + # (→ 1800 s Gloo timeout) because the main driver calls barrier() on the + # stale old group while workers wait on a fresh one that never reaches quorum. + try: + if torch.distributed.is_initialized(): + torch.distributed.destroy_process_group() + except Exception: + pass + + def parse_args(self): + """Add vLLM/NVIDIA-specific CLI flags. Base class pre-loads runner config.""" + args = super().parse_args() + cfg = self._runner_config + + # ── Runner-specific CLI flags ───────────────────────────────────────── + # Defined here (not in benchmark_runner) — vLLM/NVIDIA-specific concepts. + import argparse + parser = argparse.ArgumentParser(add_help=False) + parser.add_argument("--tensor-parallel-size", type=int, default=None, + dest="tensor_parallel_size") + parser.add_argument("--pipeline-parallel-size", type=int, default=None, + dest="pipeline_parallel_size") + parser.add_argument("--expert-parallel-size", type=int, default=None, + dest="expert_parallel_size") + parser.add_argument("--enforce-eager", action="store_true", default=False, + dest="enforce_eager") + extra, _ = parser.parse_known_args() + + # Priority: CLI flag > yaml config > required_chips > auto-detected > default 1 + # Fully resolved by base class. + tp_size, _tp_source = self._resolve_tensor_parallel_size( + extra.tensor_parallel_size + ) + + pp_size = (extra.pipeline_parallel_size + if extra.pipeline_parallel_size is not None + else cfg.get("pipeline_parallel_size", 1)) + ep_size = (extra.expert_parallel_size + if extra.expert_parallel_size is not None + else cfg.get("expert_parallel_size", 1)) + # enforce_eager: CLI flag OR yaml setting (either activates it) + self._enforce_eager = extra.enforce_eager or cfg.get("enforce_eager", False) + + print(f" tensor_parallel_size = {tp_size} [{_tp_source}]") + if ep_size > 1: + print(f" expert_parallel_size = {ep_size} [cli/yaml]") + + if not self.SUPPORTS_MULTI_CHIP and tp_size * pp_size > 1: + print(f"Warning: {self.__class__.__name__} does not support multi-chip. " + f"Ignoring tensor_parallel_size={tp_size}, using 1.") + tp_size = 1 + pp_size = 1 + ep_size = 1 + + # Report to base class — used by _compute_run_id(), _build_result_json(), etc. + # Note: for MoE with expert parallelism, chips are shared between TP and EP + # dimensions — ep_size does not add to chip count independently. + self._parallelism = { + "tensor_parallel_size": tp_size, + "pipeline_parallel_size": pp_size, + "expert_parallel_size": ep_size, + "data_parallel_size": 1, + } + self._chip_count = tp_size * pp_size + self._precision = getattr(args, "precision", None) + return args + + def get_extra_subprocess_args(self, args) -> list[str]: + """Forward vLLM/NVIDIA-specific flags to subprocess invocations.""" + extra = [ + "--tensor-parallel-size", + str(self._parallelism.get("tensor_parallel_size", 1)), + ] + if self._parallelism.get("pipeline_parallel_size", 1) > 1: + extra += ["--pipeline-parallel-size", + str(self._parallelism["pipeline_parallel_size"])] + if self._parallelism.get("expert_parallel_size", 1) > 1: + extra += ["--expert-parallel-size", + str(self._parallelism["expert_parallel_size"])] + if self._enforce_eager: + extra += ["--enforce-eager"] + return extra + + +if __name__ == "__main__": + VLLMRunner().main() \ No newline at end of file From 3c124bdcd1ecfacba89d2f08a22821ba4b7619d2 Mon Sep 17 00:00:00 2001 From: Liang Juhao Date: Mon, 18 May 2026 10:35:56 +0000 Subject: [PATCH 2/3] update --- README.md | 2 +- ...nner_nvidia_vllm020_0f6c56e4.yaml.example} | 11 +- run.py | 5 +- runners/nvidia_vllm020_0f6c56e4/README.md | 165 ++++++++++++++++++ runners/nvidia_vllm020_0f6c56e4/install.sh | 27 +++ runners/nvidia_vllm020_0f6c56e4/meta.json | 21 +++ .../nvidia_vllm020_0f6c56e4/requirements.txt | 19 ++ .../runner.py | 118 +------------ runners/nvidia_vllm_0c1710bd/README.md | 104 ----------- runners/nvidia_vllm_0c1710bd/meta.json | 21 --- runners/nvidia_vllm_0c1710bd/requirements.txt | 33 ---- 11 files changed, 252 insertions(+), 274 deletions(-) rename configs/runner_configs/{runner_nvidia_vllm_0c1710bd.yaml.example => runner_nvidia_vllm020_0f6c56e4.yaml.example} (86%) create mode 100644 runners/nvidia_vllm020_0f6c56e4/README.md create mode 100644 runners/nvidia_vllm020_0f6c56e4/install.sh create mode 100644 runners/nvidia_vllm020_0f6c56e4/meta.json create mode 100644 runners/nvidia_vllm020_0f6c56e4/requirements.txt rename runners/{nvidia_vllm_0c1710bd => nvidia_vllm020_0f6c56e4}/runner.py (65%) delete mode 100644 runners/nvidia_vllm_0c1710bd/README.md delete mode 100644 runners/nvidia_vllm_0c1710bd/meta.json delete mode 100644 runners/nvidia_vllm_0c1710bd/requirements.txt diff --git a/README.md b/README.md index 5e29038..7fac664 100644 --- a/README.md +++ b/README.md @@ -88,7 +88,7 @@ Reference runners live under `runners/` (see each folder’s `meta.json`). The t | Hardware | Runner folder | Framework | A | B | C | D | E | F | G | |---|---|---|:-:|:-:|:-:|:-:|:-:|:-:|:-:| | NVIDIA GPU | `nvidia_sglang_c43a8309` | SGLang | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -| NVIDIA GPU | `nvidia_vllm_0c1710bd` | vLLM | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | +| NVIDIA GPU | `nvidia_vllm020_0f6c56e4` | vLLM | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | | NVIDIA GPU | `nvidia_vllm_47f5d58e` | vLLM | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | NVIDIA V100 (SM70) | `nvidia_onecat_vllm_12a253c2` | 1Cat-vLLM | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | — | ⋯ | | AMD GPU | `amd_vllm_rocm_6c18cd8f` | vLLM-ROCm | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | diff --git a/configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example b/configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example similarity index 86% rename from configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example rename to configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example index dec3cbe..0802d4d 100644 --- a/configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example +++ b/configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example @@ -1,6 +1,6 @@ -# AccelMark runner config — nvidia_vllm_0c1710bd (vLLM 0.20 on NVIDIA) +# AccelMark runner config — nvidia_vllm020_0f6c56e4 (vLLM 0.20 on NVIDIA) # -# Copy this file to runner_nvidia_vllm_0c1710bd.yaml (remove .example suffix) +# Copy this file to runner_nvidia_vllm020_0f6c56e4.yaml (remove .example suffix) # and edit as needed for your hardware. The actual .yaml is gitignored. # # These settings adapt the runner to your hardware environment. @@ -52,6 +52,13 @@ gpu_memory_utilization: 0.90 # are never loaded or recorded. suites: + suite_C: + # Quantization suite (FP8/W8A8/W8A16 via compressed-tensors). + # vLLM 0.20 + CUDA graphs can produce repetitive garbage on quantized + # checkpoints (accuracy ~0 while offline throughput looks normal). + # enforce_eager disables CUDAGraph — required for correct Suite C accuracy. + enforce_eager: true + suite_D: # Long-context suite — reduce batch size and reserve more memory. max_num_seqs: 64 diff --git a/run.py b/run.py index 85849c3..d7b9dba 100644 --- a/run.py +++ b/run.py @@ -128,8 +128,11 @@ def cmd_list(args) -> int: print(f" {meta.get('description', '')}") if supersedes_chain: print(f" Replaces: {supersedes_chain[0]}") + install_sh = RUNNERS_DIR / rid / "install.sh" req_path = RUNNERS_DIR / rid / "requirements.txt" - if req_path.exists(): + if install_sh.exists(): + print(f" Install: bash runners/{rid}/install.sh") + elif req_path.exists(): print(f" Install: pip install -r runners/{rid}/requirements.txt") print() diff --git a/runners/nvidia_vllm020_0f6c56e4/README.md b/runners/nvidia_vllm020_0f6c56e4/README.md new file mode 100644 index 0000000..581e54f --- /dev/null +++ b/runners/nvidia_vllm020_0f6c56e4/README.md @@ -0,0 +1,165 @@ +# nvidia_vllm020_0f6c56e4 — NVIDIA vLLM Runner (0.20.x) + +AccelMark reference runner for NVIDIA GPUs running **vLLM 0.20.x**. + +Supersedes [`nvidia_vllm_47f5d58e`](../nvidia_vllm_47f5d58e/) (vLLM 0.7.3). Use the predecessor for CUDA 11.8 / legacy stacks; use this runner for Ampere+ datacenter GPUs with CUDA 12.8 or 13.0. + +## Supported suites + +| Suite | Description | Notes | +|-------|-------------|-------| +| Suite A | Single-chip, Llama-3-8B | Speculative and burst extra scenarios | +| Suite B | Multi-chip, Llama-3-70B | Requires 4× A100/H100 or equivalent | +| Suite C | Quantization, Llama-3.1-8B | **Requires `enforce_eager: true` in runner config** — see below | +| Suite D | Long context ~28K input | `max_model_len` 30,208 | +| Suite E | Multi-chip scaling, Llama-3-8B | NVLink recommended | +| Suite F | Consumer/edge, Qwen2.5-0.5B | Pre-Ampere: use predecessor + `--enforce-eager` | +| Suite G | MoE multi-chip, Mixtral-8x7B | ≥2× A100-80GB | + +## What changed vs nvidia_vllm_47f5d58e + +| Area | 0.7.3 (predecessor) | 0.20.x (this runner) | +|---|---|---| +| Default CUDA | 12.1 | **13.0** (12.8 via `PYTORCH_INDEX`) | +| PyTorch | 2.5.1 | **2.11** (pulled by vLLM) | +| Python | 3.10+ | **3.10–3.12** | +| Transformers | v4.57 | vLLM-pinned (see `result.json` version string) | +| FlashAttention | FA2 | FA4 (MLA prefill default on supported models) | +| Quantization | fp8, compressed-tensors, gptq_marlin | + **turboquant** | +| Model runner | V1 | V2 | + +Release notes: [v0.20.0](https://github.com/vllm-project/vllm/releases/tag/v0.20.0) · [v0.20.1](https://github.com/vllm-project/vllm/releases/tag/v0.20.1). + +## Installation + +### Prerequisites + +- NVIDIA GPU, compute capability ≥ 7.0 (Volta+; Ampere+ recommended) +- **CUDA 13.0** driver/runtime (default for this stack), or **CUDA 12.8** via PyTorch index below +- **Python 3.10, 3.11, or 3.12** (not 3.13+ until vLLM supports it) +- A clean virtualenv/conda env if upgrading from `vllm==0.7.3` (mixed installs break imports) + +### Recommended: `install.sh` + +From the AccelMark repo root: + +```bash +# Create and activate a fresh env (example) +conda create -n accel python=3.12 -y +conda activate accel + +# Default install (CUDA 13.0 wheels from vLLM) +bash runners/nvidia_vllm020_0f6c56e4/install.sh +``` + +CUDA **12.8** hosts must point pip at the cu128 PyTorch index: + +```bash +PYTORCH_INDEX=https://download.pytorch.org/whl/cu128 \ + bash runners/nvidia_vllm020_0f6c56e4/install.sh +``` + +`install.sh` reads versions from `requirements.txt` and installs in three stages (pip cannot resolve `vllm` and `mistral-common[image]` in one pass). **Do not** run `pip install -r requirements.txt` directly. + +### Verify + +```bash +python -c "import vllm, torch; print('vllm', vllm.__version__, 'torch', torch.__version__, 'cuda', torch.cuda.is_available())" +``` + +### Manual install (equivalent to `install.sh`) + +```bash +pip install mistral-common==1.11.2 +pip install vllm==0.20.1 # add --extra-index-url if using PYTORCH_INDEX above +pip install "numpy>=1.26.0,<2.0" jsonschema psutil tqdm nvidia-ml-py PyYAML +``` + +### Submitter profile and local models + +```bash +cp configs/submitter.yaml.example configs/submitter.yaml # set submitted_by +cp configs/models_local.yaml.example configs/models_local.yaml # optional local paths +``` + +## Usage + +```bash +python run.py --runner nvidia_vllm020_0f6c56e4 --suite suite_A +python run.py --runner nvidia_vllm020_0f6c56e4 --suite suite_B --tensor-parallel-size 4 +python run.py --runner nvidia_vllm020_0f6c56e4 --suite suite_C +``` + +Or invoke the runner directly: + +```bash +python runners/nvidia_vllm020_0f6c56e4/runner.py --suite suite_F --scenario offline +``` + +## Runner config + +```bash +cp configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example \ + configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml +``` + +Merge priority: CLI flags > suite-specific section > global defaults. + +### Suite C — quantization (`enforce_eager` required) + +vLLM 0.20 enables CUDA graphs by default. With `compressed-tensors` checkpoints (FP8, W8A8, W8A16), graphs can produce **repetitive garbage output**: offline throughput looks normal but MMLU accuracy drops to ~0. + +The example config sets this only for Suite C so other suites keep CUDA graphs: + +```yaml +suites: + suite_C: + enforce_eager: true +``` + +CLI override: `--enforce-eager`. Without it, Suite C accuracy results are invalid even if throughput is high. + +### Optional `engine_kwargs` (0.20) + +```yaml +engine_kwargs: + attention_backend: FLASH_ATTN_4 + # compilation_config: + # cudagraph_mode: full_and_piecewise + # kv_cache_dtype: turboquant # experimental; suite C +``` + +See [vLLM EngineArgs](https://docs.vllm.ai/en/latest/api/vllm/engine/arg_utils.html). + +## Troubleshooting + +### Large-memory GPUs (H20, A100 80GB) — SIGFPE / silent crash + +Symptom: subprocess exits with `SIGFPE (return code -8)` after model load or on first batch. + +```bash +pip install --upgrade nvidia-cublas-cu13 +``` + +On CUDA 12.8 stacks use `nvidia-cublas-cu12` instead. Details: [predecessor README](../nvidia_vllm_47f5d58e/README.md#large-memory-gpus-h20-a100-80-gb-etc). + +### Pre-Ampere (V100, T4, RTX 20xx) + +This runner targets Ampere+ with CUDA 12.8/13.0. For Volta/Turing, use [`nvidia_vllm_47f5d58e`](../nvidia_vllm_47f5d58e/) with `--enforce-eager` (BF16→FP16 fallback, no CUDA graphs). See the predecessor README for Suite F / Suite A on V100. + +### Suite C accuracy ~0 but offline OK + +Enable `enforce_eager` for `suite_C` in the runner config (see above) and re-run the accuracy scenario. + +## Hardware matrix + +Full GPU compatibility table: [`nvidia_vllm_47f5d58e/README.md`](../nvidia_vllm_47f5d58e/README.md#hardware-compatibility). + +## Files + +| File | Purpose | +|------|---------| +| `runner.py` | Runner implementation | +| `meta.json` | Runner metadata and suite support | +| `requirements.txt` | Pinned dependency list (source of truth) | +| `install.sh` | Staged pip install | diff --git a/runners/nvidia_vllm020_0f6c56e4/install.sh b/runners/nvidia_vllm020_0f6c56e4/install.sh new file mode 100644 index 0000000..e4e8292 --- /dev/null +++ b/runners/nvidia_vllm020_0f6c56e4/install.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +# Install dependencies from requirements.txt in three stages. +# pip cannot resolve vllm and mistral-common[image] in a single install pass. +set -euo pipefail + +RUNNER_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REQ="${RUNNER_DIR}/requirements.txt" +EXTRA=() +if [[ -n "${PYTORCH_INDEX:-}" ]]; then + EXTRA=(--extra-index-url "${PYTORCH_INDEX}") +fi + +line() { awk -v p="$1" '$0 ~ "^" p "[=<>]" { print; exit }' "${REQ}"; } + +echo "==> $(line mistral-common)" +pip install "$(line mistral-common)" + +echo "==> $(line vllm)" +pip install "$(line vllm)" "${EXTRA[@]}" + +TMP="$(mktemp)" +trap 'rm -f "${TMP}"' EXIT +awk '!/^#/ && NF && $0 !~ /^mistral-common/ && $0 !~ /^vllm/' "${REQ}" > "${TMP}" +echo "==> AccelMark utilities" +pip install -r "${TMP}" + +python -c "import vllm; print('OK — vllm', vllm.__version__)" diff --git a/runners/nvidia_vllm020_0f6c56e4/meta.json b/runners/nvidia_vllm020_0f6c56e4/meta.json new file mode 100644 index 0000000..fcf9f0c --- /dev/null +++ b/runners/nvidia_vllm020_0f6c56e4/meta.json @@ -0,0 +1,21 @@ +{ + "id": "nvidia_vllm020_0f6c56e4", + "platform": "nvidia", + "name": "vLLM 0.20 on NVIDIA", + "framework": "vLLM", + "submitted_by": "JuhaoLiang1997", + "description": "AccelMark reference runner for NVIDIA GPUs using vLLM 0.20.x. Supersedes nvidia_vllm_47f5d58e (vLLM 0.7.3). Supports suites A–G.", + "supersedes_chain": [], + "notes": "vLLM 0.20.x line: torch 2.11, CUDA 13.0 default. Adds turboquant backend. Suite C requires enforce_eager in runner config (see README).", + "created": "2026-05-15", + "hardware_label": null, + "suite_support": { + "A": "pending", + "B": "pending", + "C": "pending", + "D": "pending", + "E": "pending", + "F": "pending", + "G": "pending" + } +} diff --git a/runners/nvidia_vllm020_0f6c56e4/requirements.txt b/runners/nvidia_vllm020_0f6c56e4/requirements.txt new file mode 100644 index 0000000..b09fcee --- /dev/null +++ b/runners/nvidia_vllm020_0f6c56e4/requirements.txt @@ -0,0 +1,19 @@ +# AccelMark — NVIDIA vLLM 0.20.x dependencies +# +# Install: bash install.sh +# Do not: pip install -r requirements.txt (pip mistral-common[image] resolver bug) +# +# Python 3.10–3.12. Reference stack: torch 2.11 + vllm 0.20.1 + CUDA 13.0 +# CUDA 12.8: PYTORCH_INDEX=https://download.pytorch.org/whl/cu128 bash install.sh + +# --- vLLM stack (install.sh stages these; torch/transformers pulled by vllm) --- +mistral-common==1.11.2 +vllm==0.20.1 + +# --- AccelMark utilities --- +numpy>=1.26.0,<2.0 +jsonschema>=4.20.0 +psutil>=7.0.0 +tqdm>=4.66.0 +nvidia-ml-py>=13.0 +PyYAML>=6.0 diff --git a/runners/nvidia_vllm_0c1710bd/runner.py b/runners/nvidia_vllm020_0f6c56e4/runner.py similarity index 65% rename from runners/nvidia_vllm_0c1710bd/runner.py rename to runners/nvidia_vllm020_0f6c56e4/runner.py index 4321d96..8383dfa 100644 --- a/runners/nvidia_vllm_0c1710bd/runner.py +++ b/runners/nvidia_vllm020_0f6c56e4/runner.py @@ -1,26 +1,7 @@ """ -AccelMark — NVIDIA vLLM benchmark script (vLLM 0.20.x line). - -Implements BenchmarkRunner for vLLM 0.20.x on NVIDIA GPUs. This runner -supersedes ``nvidia_vllm_47f5d58e`` (the 0.7.3 line) and updates the -reference stack to the 2026 vLLM major release. - -What changed relative to the predecessor runner: - - - **Dependencies bumped** to the vLLM 0.20.x reference: torch 2.11, - CUDA 13.0 (or 12.8 via opt-in extra-index), HuggingFace Transformers v5, - Python 3.14 compatible. See ``requirements.txt`` for the pinned list. - - **TurboQuant 2-bit KV cache** declared as a quantization backend - (``turboquant``) — new in 0.20.0 and not available on older vLLM lines. - Other backends (FP8, compressed-tensors, gptq_marlin) are preserved. - - **Framework version string** now reports both ``vllm`` and - ``transformers`` versions so result.json captures the v5 transition. - -Everything else is byte-identical in structure to the previous runner — -0.20 keeps the ``LLM`` / ``AsyncLLMEngine`` / ``SamplingParams`` public API. -The EngineArgs-field filter already handles unknown 0.20 kwargs gracefully, -so existing runner-config YAMLs continue to work after upgrade. +AccelMark — NVIDIA vLLM benchmark script (vLLM 0.20.x). +Implements BenchmarkRunner for vLLM 0.20.x on NVIDIA GPUs. All orchestration logic lives in runners/benchmark_runner.py. """ @@ -30,7 +11,6 @@ from pathlib import Path from typing import Optional -# Add repo root to path _REPO_ROOT = Path(__file__).resolve().parent.parent.parent sys.path.insert(0, str(_REPO_ROOT)) @@ -43,8 +23,6 @@ from loadgen.types import InferenceResult - -# Suppress per-request vLLM logs by default import logging logging.getLogger("vllm.engine.async_llm_engine").setLevel(logging.WARNING) logging.getLogger("vllm.engine.llm_engine").setLevel(logging.WARNING) @@ -58,11 +36,7 @@ class VLLMRunner(BenchmarkRunner): SUPPORTS_ONLINE = True SUPPORTS_MULTI_CHIP = True - # vLLM on NVIDIA supports all precisions — hardware detection in BenchmarkRunner - # will automatically restrict to FP16 on V100/T4 SUPPORTED_PRECISIONS = ["bf16", "fp16", "fp32"] - # 0.20.0 added the TurboQuant 2-bit KV cache backend (4x KV capacity vs FP16). - # FP8 / compressed-tensors / gptq_marlin remain from the 0.7.x baseline. SUPPORTED_QUANTIZATION_BACKENDS = [ "fp8", "compressed-tensors", @@ -90,12 +64,6 @@ def _get_framework_name(self) -> str: return "vLLM" def _get_framework_version(self) -> str: - """Report vllm + transformers versions. - - vLLM 0.20 ships with Transformers v5 support; including the - transformers version in result.json makes it explicit when a result - was generated against the v4 vs v5 line. - """ vllm_v = "unknown" try: import vllm @@ -131,10 +99,6 @@ def load_model(self, model_path: str, parallelism: dict) -> None: gpu_memory_util = cfg.get("gpu_memory_utilization", 0.90) extra_kwargs = dict(cfg.get("engine_kwargs") or {}) - # ── Filter engine_kwargs to only fields this vLLM version accepts ───── - # Avoids TypeError when the runner config YAML references a field that - # doesn't exist in the installed vLLM version (EngineArgs is a strict - # dataclass — unknown keyword arguments raise TypeError immediately). try: import dataclasses from vllm.engine.arg_utils import EngineArgs as _EngineArgs @@ -145,43 +109,29 @@ def load_model(self, model_path: str, parallelism: dict) -> None: f"vLLM version and will be ignored: {list(_dropped)}") extra_kwargs = {k: v for k, v in extra_kwargs.items() if k in _valid} except Exception: - pass # If introspection fails, pass kwargs as-is and let vLLM report the error + pass - # Use precision resolved by BenchmarkRunner._resolve_precision() effective_precision = getattr(self, "_effective_precision", "BF16").upper() precision = getattr(self, "_precision", None) or effective_precision - # dtype_override and quantization may be injected by benchmark_runner from - # precision_model_map entry fields (dtype_override, engine_kwargs.quantization). - # These take priority over the runner's own precision→dtype mapping below. _dtype_override = getattr(self, "_precision_dtype_override", None) _prec_eng_kwargs = dict(getattr(self, "_precision_engine_kwargs", None) or {}) quantization = _prec_eng_kwargs.pop("quantization", None) - # Map native precision names to explicit dtypes. - # Quantized formats (anything not in this map) use dtype="auto" — vLLM reads - # the storage dtype from the checkpoint's config.json, and the quantization - # kernel is set explicitly via the `quantization` kwarg already populated above - # from precision_model_map engine_kwargs. No fallback guessing needed here. _NATIVE_DTYPE_MAP = { "BF16": "bfloat16", "FP16": "float16", "FP32": "float32", } dtype = _NATIVE_DTYPE_MAP.get(precision, "auto") - self._quantization_method = quantization # None for native, explicit str for quantized + self._quantization_method = quantization - # dtype_override from precision_model_map wins over the mapping above. - # Used for e.g. FP16 baseline on pre-Ampere hardware (V100/T4). if _dtype_override: dtype = _dtype_override - # Merge remaining precision_engine_kwargs (after popping quantization) into - # extra_kwargs so they reach LLM() / AsyncEngineArgs. Runner YAML engine_kwargs - # still take final precedence via the **extra_kwargs spread at the end. if _prec_eng_kwargs: - _prec_eng_kwargs.update(extra_kwargs) # runner YAML wins on conflict + _prec_eng_kwargs.update(extra_kwargs) extra_kwargs = _prec_eng_kwargs print(f"Loading model: precision={precision}, dtype={dtype}" @@ -226,8 +176,6 @@ def load_model(self, model_path: str, parallelism: dict) -> None: trust_remote_code=False, enforce_eager=enforce_eager, gpu_memory_utilization=gpu_memory_util, - # engine_kwargs values override named fields above if the same key appears in both. - # This is intentional — engine_kwargs is the power-user escape hatch. **extra_kwargs, ) if ep_size > 1: @@ -238,42 +186,23 @@ def load_model(self, model_path: str, parallelism: dict) -> None: self.engine = AsyncLLMEngine.from_engine_args(engine_args) def get_effective_dtype(self) -> Optional[str]: - """ - Report the actual compute dtype vLLM used after model loading. - - vLLM exposes the resolved dtype via model_config after initialization. - This captures cases like FP8 weights on A100 computing in BF16. - """ try: if self.llm is not None: - # Sync LLM path dtype = self.llm.llm_engine.model_config.dtype return str(dtype).replace("torch.", "") elif self.engine is not None: - # Async engine path dtype = self.engine.engine.model_config.dtype return str(dtype).replace("torch.", "") except Exception: pass - # Fall back to declared dtype if introspection fails return getattr(self, "_effective_dtype", None) def inference_fn_offline(self, requests: list[InferenceRequest]) -> list[InferenceResult]: - """Send all requests to vLLM at once. vLLM handles internal batching. - - total_time_ms in each returned InferenceResult is set to the wall-clock - elapsed time of the entire batch — NOT an individual per-request latency. - vLLM's sync LLM.generate() blocks until all requests finish, so there is - no per-request completion timestamp available. All results share the same - total_time_ms value, which is the correct denominator for throughput: - throughput = total_tokens / (elapsed_ms / 1000) - """ formatted = [self._format_prompt(r.prompt) for r in requests] t_start = time.perf_counter() outputs = self.llm.generate(formatted, self.sampling_params) elapsed = time.perf_counter() - t_start - # Store output text for _run_accuracy_integrated() self._last_accuracy_outputs = [o.outputs[0].text for o in outputs] results = [] @@ -289,7 +218,6 @@ def inference_fn_offline(self, requests: list[InferenceRequest]) -> list[Inferen return results async def inference_fn_streaming(self, request: InferenceRequest) -> InferenceResult: - """Stream a single request, measuring TTFT.""" from vllm.utils import random_uuid formatted = self._format_prompt(request.prompt) @@ -321,15 +249,6 @@ async def inference_fn_streaming(self, request: InferenceRequest) -> InferenceRe ) async def inference_fn_token_stream(self, request: InferenceRequest): - """ - Async generator yielding decoded text deltas for the serve layer. - - Each yield is the delta text since the last output — new characters - only, not the full accumulated string. - - vLLM's engine.generate() yields cumulative outputs, so we track the - previous text length and slice off only the new portion each step. - """ from vllm.utils import random_uuid formatted = self._format_prompt(request.prompt) @@ -352,7 +271,6 @@ def get_peak_memory_gb(self) -> float: return None def release_resources(self) -> None: - """Release vLLM engines and distributed state.""" if self.llm is not None: try: del self.llm @@ -372,18 +290,10 @@ def release_resources(self) -> None: pass self.engine = None - # Destroy vLLM's distributed state so the next engine initialisation - # creates a fresh TCPStore server. Must call destroy_model_parallel() - # first to clear vLLM's cached group references; only then is it safe - # to destroy the underlying torch process group. Skipping this step - # leaves torch.distributed.is_initialized()==True, which causes - # init_distributed_environment() to skip creating the new TCPStore - # server, so spawned worker processes can never connect (→ 600 s timeout). try: from vllm.distributed.parallel_state import cleanup_dist_env_and_memory cleanup_dist_env_and_memory(shutdown_ray=False) except Exception: - # Fallback for older vLLM builds that lack cleanup_dist_env_and_memory try: from vllm.distributed.parallel_state import ( destroy_model_parallel, destroy_distributed_environment, @@ -393,12 +303,6 @@ def release_resources(self) -> None: except Exception: pass - # Final guard: if torch.distributed is still initialized after the cleanup - # attempts above, destroy the default process group here. Without this, - # vLLM's init_distributed_environment() skips TCPStore server creation on - # the next LLM() init, so new worker processes can never join the barrier - # (→ 1800 s Gloo timeout) because the main driver calls barrier() on the - # stale old group while workers wait on a fresh one that never reaches quorum. try: if torch.distributed.is_initialized(): torch.distributed.destroy_process_group() @@ -406,12 +310,9 @@ def release_resources(self) -> None: pass def parse_args(self): - """Add vLLM/NVIDIA-specific CLI flags. Base class pre-loads runner config.""" args = super().parse_args() cfg = self._runner_config - # ── Runner-specific CLI flags ───────────────────────────────────────── - # Defined here (not in benchmark_runner) — vLLM/NVIDIA-specific concepts. import argparse parser = argparse.ArgumentParser(add_help=False) parser.add_argument("--tensor-parallel-size", type=int, default=None, @@ -424,8 +325,6 @@ def parse_args(self): dest="enforce_eager") extra, _ = parser.parse_known_args() - # Priority: CLI flag > yaml config > required_chips > auto-detected > default 1 - # Fully resolved by base class. tp_size, _tp_source = self._resolve_tensor_parallel_size( extra.tensor_parallel_size ) @@ -436,7 +335,6 @@ def parse_args(self): ep_size = (extra.expert_parallel_size if extra.expert_parallel_size is not None else cfg.get("expert_parallel_size", 1)) - # enforce_eager: CLI flag OR yaml setting (either activates it) self._enforce_eager = extra.enforce_eager or cfg.get("enforce_eager", False) print(f" tensor_parallel_size = {tp_size} [{_tp_source}]") @@ -450,9 +348,6 @@ def parse_args(self): pp_size = 1 ep_size = 1 - # Report to base class — used by _compute_run_id(), _build_result_json(), etc. - # Note: for MoE with expert parallelism, chips are shared between TP and EP - # dimensions — ep_size does not add to chip count independently. self._parallelism = { "tensor_parallel_size": tp_size, "pipeline_parallel_size": pp_size, @@ -464,7 +359,6 @@ def parse_args(self): return args def get_extra_subprocess_args(self, args) -> list[str]: - """Forward vLLM/NVIDIA-specific flags to subprocess invocations.""" extra = [ "--tensor-parallel-size", str(self._parallelism.get("tensor_parallel_size", 1)), @@ -481,4 +375,4 @@ def get_extra_subprocess_args(self, args) -> list[str]: if __name__ == "__main__": - VLLMRunner().main() \ No newline at end of file + VLLMRunner().main() diff --git a/runners/nvidia_vllm_0c1710bd/README.md b/runners/nvidia_vllm_0c1710bd/README.md deleted file mode 100644 index 780596f..0000000 --- a/runners/nvidia_vllm_0c1710bd/README.md +++ /dev/null @@ -1,104 +0,0 @@ -# nvidia_vllm_0c1710bd — NVIDIA vLLM Runner (0.20.x line) - -AccelMark reference runner for NVIDIA GPUs running **vLLM 0.20.x** — -the 2026 major release. - -This runner supersedes [`nvidia_vllm_47f5d58e`](../nvidia_vllm_47f5d58e/) -(vLLM 0.7.3). The predecessor remains runnable; this folder is what new -results on Ampere / Hopper / Blackwell hosts should reference going forward. - -## What changed vs nvidia_vllm_47f5d58e - -| Area | 0.7.3 (predecessor) | 0.20.x (this runner) | -|---|---|---| -| Default CUDA | 12.1 | **13.0** (12.8 still supported via the PyTorch cu128 index) | -| PyTorch | 2.5.1 | **2.11.0** | -| Python | 3.10+ | 3.10+ (3.14 newly supported) | -| HuggingFace Transformers | v4.57 | **v5.x** | -| FlashAttention | FA2 | **FA4** (MLA prefill default) | -| Quantization backends declared | fp8, compressed-tensors, gptq_marlin | + **turboquant** (2-bit KV cache, 4x KV capacity) | -| Model Runner | V1 | **V2** (Eagle prefill full-CUDA-graph, fused probabilistic rejection sampling) | -| DeepSeek V4 | — | ✅ | -| Result version string | `vllm 0.7.3` | `vllm 0.20.1+transformers-5.1.0` | - -Detailed release notes: -[vLLM v0.20.0](https://github.com/vllm-project/vllm/releases/tag/v0.20.0) -· [vLLM v0.20.1 patch](https://github.com/vllm-project/vllm/releases/tag/v0.20.1). - -## Supported suites - -Same coverage as the predecessor runner — **all suites A–G**. See -[`runners/nvidia_vllm_47f5d58e/README.md`](../nvidia_vllm_47f5d58e/README.md) -for the per-GPU hardware compatibility matrix; the same rows apply here -because the runner code is a structural clone. - -## Installation - -```bash -# 1. Standard install — CUDA 13.0 stack -pip install -r runners/nvidia_vllm_0c1710bd/requirements.txt - -# 2. CUDA 12.8 stack (for hosts still on the cu128 driver): -pip install -r runners/nvidia_vllm_0c1710bd/requirements.txt \ - --extra-index-url https://download.pytorch.org/whl/cu128 -``` - -> Older runners pinned `nvidia-cublas-cu12`; on 0.20 + CUDA 13.0 use -> `nvidia-cublas-cu13` if you encounter the cuBLAS SIGFPE on large-memory -> GPUs (same fix philosophy as the predecessor's README — only the package -> name changes). - -## Basic usage - -Identical to the predecessor: - -```bash -python run.py --runner nvidia_vllm_0c1710bd --suite suite_A -python run.py --runner nvidia_vllm_0c1710bd --suite suite_B \ - --tensor-parallel-size 4 -``` - -## 0.20-specific knobs you may want to enable - -`engine_kwargs` in the runner config are passed straight to -`AsyncEngineArgs` / `LLM`. The runner already filters unknown fields, so -adding 0.20-only keys is safe even if you downgrade vLLM later — they will -be dropped with a warning rather than blowing up at startup. - -```yaml -# configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml -engine_kwargs: - # FlashAttention 4 (default on 0.20 — listed here only if you need - # to pin it for reproducibility): - attention_backend: FLASH_ATTN_4 - # CUDA graph improvements added in 0.20: - compilation_config: - cudagraph_mode: full_and_piecewise - # TurboQuant 2-bit KV cache (suite C with --precision turboquant): - # kv_cache_dtype: turboquant -``` - -## Runner config - -Copy the example: - -```bash -cp configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml.example \ - configs/runner_configs/runner_nvidia_vllm_0c1710bd.yaml -``` - -Field names and defaults are identical to the predecessor — see -[`runner_nvidia_vllm_47f5d58e.yaml.example`](../../configs/runner_configs/runner_nvidia_vllm_47f5d58e.yaml.example) -for the field reference. - -## Status - -- **Code:** structurally identical to the predecessor + the small additions - documented above. The change is principally a dependency bump. -- **Validation:** not yet run end-to-end on a 0.20 install at the time of - commit. The predecessor's test_smoke.py path applies once the test file is - ported over. -- **Predecessor:** `nvidia_vllm_47f5d58e/meta.json` will receive a - `deprecated_by` pointer in a follow-up PR once a smoke result against this - runner has been verified. Until then the predecessor remains the - recommended runner for production result submissions. diff --git a/runners/nvidia_vllm_0c1710bd/meta.json b/runners/nvidia_vllm_0c1710bd/meta.json deleted file mode 100644 index 876d403..0000000 --- a/runners/nvidia_vllm_0c1710bd/meta.json +++ /dev/null @@ -1,21 +0,0 @@ -{ - "id": "nvidia_vllm_0c1710bd", - "platform": "nvidia", - "name": "vLLM 0.20 on NVIDIA", - "framework": "vLLM", - "submitted_by": "JuhaoLiang1997", - "description": "AccelMark reference runner for NVIDIA GPUs running vLLM 0.20.x. Updates the predecessor (nvidia_vllm_47f5d58e, vLLM 0.7.3) to the 2026 vLLM major release: torch 2.11, CUDA 13.0 default (12.8 still supported), HuggingFace Transformers v5, FlashAttention 4 MLA prefill, Model Runner V2, and TurboQuant 2-bit KV cache. Supports all suites A–G.", - "supersedes_chain": ["nvidia_vllm_47f5d58e"], - "notes": "Initial 0.20.x runner. SUPPORTED_QUANTIZATION_BACKENDS adds 'turboquant' on top of fp8 / compressed-tensors / gptq_marlin. Framework version string now reports vllm + transformers together so v4 vs v5 transformers transitions are visible in result.json. Reuses EngineArgs-field filtering to absorb new 0.20 engine kwargs without breaking old configs. Predecessor is still runnable and remains the recommended choice for CUDA 11.8 / pre-Ampere hosts; deprecated_by will be set on nvidia_vllm_47f5d58e in a follow-up PR once a smoke result on this runner exists.", - "created": "2026-05-15", - "hardware_label": null, - "suite_support": { - "A": "pending", - "B": "pending", - "C": "pending", - "D": "pending", - "E": "pending", - "F": "pending", - "G": "pending" - } -} diff --git a/runners/nvidia_vllm_0c1710bd/requirements.txt b/runners/nvidia_vllm_0c1710bd/requirements.txt deleted file mode 100644 index 50b1db7..0000000 --- a/runners/nvidia_vllm_0c1710bd/requirements.txt +++ /dev/null @@ -1,33 +0,0 @@ -# AccelMark -- NVIDIA platform dependencies (vLLM 0.20.x line) -# Reference tested combination: torch 2.11 + vLLM 0.20.1 + CUDA 13.0 -# (CUDA 12.8 still supported via --extra-index-url; see README.md) -# -# Core -torch==2.11.0 -torchvision==0.26.0 -torchaudio==2.11.0 - -# LLM inference -vllm==0.20.1 - -# Transformers v5 (introduced as required by vLLM 0.20.0) -transformers==5.1.0 -tokenizers==0.23.0 -huggingface-hub==0.36.0 -accelerate==1.10.1 -safetensors==0.7.0 - -# AccelMark dependencies -numpy==1.26.4 -jsonschema==4.25.1 -psutil==7.1.0 -tqdm==4.67.1 - -# NVIDIA monitoring (for power and GPU stats) -nvidia-ml-py==13.580.82 - -# Async support -aiohttp==3.12.15 - -# Config file parsing -PyYAML==6.0.2 From 91c450b20afe55ced6b3842d882afefaff4687e7 Mon Sep 17 00:00:00 2001 From: Liang Juhao Date: Tue, 19 May 2026 00:51:48 +0000 Subject: [PATCH 3/3] upload vllm 0.20 A100 results --- ...unner_nvidia_vllm020_0f6c56e4.yaml.example | 6 +- .../accuracy/accuracy.json | 8 + .../burst/result.json | 160 ++ .../env_info.json | 49 + .../interactive/result.json | 132 ++ .../offline/result.json | 165 ++ .../online/result.json | 164 ++ .../result.json | 572 +++++++ .../sustained/result.json | 424 +++++ .../bf16/accuracy/accuracy.json | 8 + .../bf16/offline/result.json | 178 ++ .../bf16/online/result.json | 176 ++ .../bf16/result.json | 395 +++++ .../bf16/sustained/result.json | 274 +++ .../env_info.json | 49 + .../result.json | 1519 +++++++++++++++++ .../w4a16/accuracy/accuracy.json | 8 + .../w4a16/offline/result.json | 183 ++ .../w4a16/online/result.json | 181 ++ .../w4a16/result.json | 400 +++++ .../w4a16/sustained/result.json | 279 +++ .../w8a16/accuracy/accuracy.json | 8 + .../w8a16/offline/result.json | 183 ++ .../w8a16/online/result.json | 181 ++ .../w8a16/result.json | 400 +++++ .../w8a16/sustained/result.json | 279 +++ .../w8a8/accuracy/accuracy.json | 8 + .../w8a8/offline/result.json | 183 ++ .../w8a8/online/result.json | 181 ++ .../w8a8/result.json | 400 +++++ .../w8a8/sustained/result.json | 279 +++ .../accuracy/accuracy.json | 8 + .../env_info.json | 49 + .../interactive/result.json | 132 ++ .../offline/result.json | 152 ++ .../online/result.json | 169 ++ .../result.json | 519 ++++++ .../sustained/result.json | 424 +++++ .../accuracy/accuracy.json | 8 + .../env_info.json | 49 + .../interactive/result.json | 137 ++ .../offline/result.json | 170 ++ .../online/result.json | 157 ++ .../result.json | 375 ++++ .../sustained/result.json | 279 +++ runners/benchmark_runner.py | 15 +- runners/nvidia_vllm020_0f6c56e4/README.md | 14 +- 47 files changed, 10078 insertions(+), 11 deletions(-) create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/burst/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/env_info.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/interactive/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/sustained/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/sustained/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/env_info.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/sustained/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/sustained/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/sustained/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/env_info.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/interactive/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/sustained/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/accuracy/accuracy.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/env_info.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/interactive/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/offline/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/online/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/result.json create mode 100644 results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/sustained/result.json diff --git a/configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example b/configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example index 0802d4d..906f7d7 100644 --- a/configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example +++ b/configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example @@ -54,9 +54,9 @@ gpu_memory_utilization: 0.90 suites: suite_C: # Quantization suite (FP8/W8A8/W8A16 via compressed-tensors). - # vLLM 0.20 + CUDA graphs can produce repetitive garbage on quantized - # checkpoints (accuracy ~0 while offline throughput looks normal). - # enforce_eager disables CUDAGraph — required for correct Suite C accuracy. + # enforce_eager disables CUDA graphs — required for W8A8/W8A16 accuracy on vLLM 0.20. + # Note: FP8 still fails on Ampere (A100, sm < 8.9): vLLM 0.20 uses broken Marlin + # weight-only FP8 fallback. Use H100+ for Suite C FP8, or vLLM 0.7.3 runner on A100. enforce_eager: true suite_D: diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/accuracy/accuracy.json new file mode 100644 index 0000000..d837d1a --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.61, + "baseline_delta": 0.01, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/burst/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/burst/result.json new file mode 100644 index 0000000..133a65b --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/burst/result.json @@ -0,0 +1,160 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_A", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Meta-Llama-3-8B-Instruct", + "model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "burst", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "burst": { + "sla_ttft_ms": 500, + "burst_steady_qps": 5, + "burst_peak_qps": 25, + "burst_duration_seconds": 30, + "burst_interval_seconds": 120, + "steady_requests_total": 1812, + "burst_requests_total": 2245, + "steady_ttft_p50_ms": 39.39, + "steady_ttft_p99_ms": 79.1, + "burst_ttft_p50_ms": 7082.87, + "burst_ttft_p99_ms": 17212.99, + "sla_met_during_burst": false, + "burst_degradation_ratio": 217.605, + "results_by_cycle": [ + { + "cycle": 1, + "steady_requests": 581, + "burst_requests": 760, + "steady_ttft_p99_ms": 89.81, + "burst_ttft_p99_ms": 17855.37 + }, + { + "cycle": 2, + "steady_requests": 595, + "burst_requests": 734, + "steady_ttft_p99_ms": 47.72, + "burst_ttft_p99_ms": 16592.12 + }, + { + "cycle": 3, + "steady_requests": 636, + "burst_requests": 751, + "steady_ttft_p99_ms": 48.05, + "burst_ttft_p99_ms": 16579.9 + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "05:55:58", + "run_id": "8f83bfab", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T05:46:54.960197+00:00", + "benchmark_end_time": "2026-05-18T05:55:58.450157+00:00", + "benchmark_elapsed_minutes": 9.1, + "model_load_seconds": 39.5 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/env_info.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/env_info.json new file mode 100644 index 0000000..ccee920 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/env_info.json @@ -0,0 +1,49 @@ +{ + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/interactive/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/interactive/result.json new file mode 100644 index 0000000..5232cac --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/interactive/result.json @@ -0,0 +1,132 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_A", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Meta-Llama-3-8B-Instruct", + "model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "interactive", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "interactive": { + "ttft_ms_p50": 27.78, + "ttft_ms_p90": 42.95, + "ttft_ms_p99": 59.38, + "tpot_ms_p50": 10.77, + "tpot_ms_p90": 10.84, + "tpot_ms_p99": 10.86, + "peak_memory_gb": null, + "elapsed_seconds_median": 570.0 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "05:14:04", + "run_id": "8f83bfab", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T04:45:25.062974+00:00", + "benchmark_end_time": "2026-05-18T05:14:04.982045+00:00", + "benchmark_elapsed_minutes": 28.7, + "model_load_seconds": 40.4 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/offline/result.json new file mode 100644 index 0000000..89283ef --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/offline/result.json @@ -0,0 +1,165 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_A", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Meta-Llama-3-8B-Instruct", + "model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 8, + "throughput_tokens_per_sec": 3871.19, + "throughput_tokens_per_sec_per_chip": 3871.19, + "throughput_tokens_per_sec_total": 6746.91, + "elapsed_seconds_median": 8.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 32, + "throughput_tokens_per_sec": 3916.69, + "throughput_tokens_per_sec_per_chip": 3916.69, + "throughput_tokens_per_sec_total": 6785.8, + "elapsed_seconds_median": 8.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 128, + "throughput_tokens_per_sec": 3908.22, + "throughput_tokens_per_sec_per_chip": 3908.22, + "throughput_tokens_per_sec_total": 6779.65, + "elapsed_seconds_median": 8.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "04:35:38", + "run_id": "8f83bfab", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T04:33:49.881300+00:00", + "benchmark_end_time": "2026-05-18T04:35:38.648555+00:00", + "benchmark_elapsed_minutes": 1.8, + "model_load_seconds": 58.5 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/online/result.json new file mode 100644 index 0000000..8fe97d2 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/online/result.json @@ -0,0 +1,164 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_A", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Meta-Llama-3-8B-Instruct", + "model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 5, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 40.2, + "ttft_ms_p90": 60.2, + "ttft_ms_p99": 92.29, + "tpot_ms_p50": 13.21, + "tpot_ms_p90": 14.2, + "tpot_ms_p99": 14.71, + "elapsed_seconds_median": 69.1, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 75.24, + "ttft_ms_p90": 4335.91, + "ttft_ms_p99": 5305.98, + "tpot_ms_p50": 22.28, + "tpot_ms_p90": 24.58, + "tpot_ms_p99": 26.25, + "elapsed_seconds_median": 25.0, + "sla_met": false + }, + { + "target_qps": 100, + "achieved_qps": 100.0, + "ttft_ms_p50": 1710.17, + "ttft_ms_p90": 10195.6, + "ttft_ms_p99": 10706.9, + "tpot_ms_p50": 22.18, + "tpot_ms_p90": 24.52, + "tpot_ms_p99": 28.04, + "elapsed_seconds_median": 22.3, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "04:42:53", + "run_id": "8f83bfab", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T04:37:07.775120+00:00", + "benchmark_end_time": "2026-05-18T04:42:53.648821+00:00", + "benchmark_elapsed_minutes": 5.8, + "model_load_seconds": 60.9 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/result.json new file mode 100644 index 0000000..ca26d93 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/result.json @@ -0,0 +1,572 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_A", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Meta-Llama-3-8B-Instruct", + "model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "online", + "interactive", + "sustained", + "burst" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": null + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 8, + "throughput_tokens_per_sec": 3871.19, + "throughput_tokens_per_sec_per_chip": 3871.19, + "throughput_tokens_per_sec_total": 6746.91, + "elapsed_seconds_median": 8.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 32, + "throughput_tokens_per_sec": 3916.69, + "throughput_tokens_per_sec_per_chip": 3916.69, + "throughput_tokens_per_sec_total": 6785.8, + "elapsed_seconds_median": 8.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 128, + "throughput_tokens_per_sec": 3908.22, + "throughput_tokens_per_sec_per_chip": 3908.22, + "throughput_tokens_per_sec_total": 6779.65, + "elapsed_seconds_median": 8.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 5, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 40.2, + "ttft_ms_p90": 60.2, + "ttft_ms_p99": 92.29, + "tpot_ms_p50": 13.21, + "tpot_ms_p90": 14.2, + "tpot_ms_p99": 14.71, + "elapsed_seconds_median": 69.1, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 75.24, + "ttft_ms_p90": 4335.91, + "ttft_ms_p99": 5305.98, + "tpot_ms_p50": 22.28, + "tpot_ms_p90": 24.58, + "tpot_ms_p99": 26.25, + "elapsed_seconds_median": 25.0, + "sla_met": false + }, + { + "target_qps": 100, + "achieved_qps": 100.0, + "ttft_ms_p50": 1710.17, + "ttft_ms_p90": 10195.6, + "ttft_ms_p99": 10706.9, + "tpot_ms_p50": 22.18, + "tpot_ms_p90": 24.52, + "tpot_ms_p99": 28.04, + "elapsed_seconds_median": 22.3, + "sla_met": false + } + ] + }, + "interactive": { + "ttft_ms_p50": 27.78, + "ttft_ms_p90": 42.95, + "ttft_ms_p99": 59.38, + "tpot_ms_p50": 10.77, + "tpot_ms_p90": 10.84, + "tpot_ms_p99": 10.86, + "peak_memory_gb": null, + "elapsed_seconds_median": 570.0 + }, + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 30, + "warmup_minutes": 2, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": true, + "throughput_tokens_per_sec": 666.3, + "tokens_out": 39988, + "tokens_in": 0, + "requests_completed": 115, + "ttft_ms_p50": 43.8, + "ttft_ms_p99": 345.6 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 707.4, + "tokens_out": 42459, + "tokens_in": 0, + "requests_completed": 126, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.9 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 710.2, + "tokens_out": 42613, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 35.9 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 728.3, + "tokens_out": 43675, + "tokens_in": 0, + "requests_completed": 126, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 707.4, + "tokens_out": 42459, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.3, + "ttft_ms_p99": 41.4 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.2, + "tokens_out": 42670, + "tokens_in": 0, + "requests_completed": 128, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.3 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 708.9, + "tokens_out": 42534, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 35.8 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.6, + "tokens_out": 42083, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 40.0 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 723.0, + "tokens_out": 43388, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 39.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 702.0, + "tokens_out": 42108, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 39.2 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 716.4, + "tokens_out": 42995, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 39.5 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.5, + "tokens_out": 43125, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.7 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 716.5, + "tokens_out": 42976, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.0 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 708.7, + "tokens_out": 42543, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.1 + }, + { + "minute": 15.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.1, + "tokens_out": 42666, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.9 + }, + { + "minute": 16.0, + "is_warmup": false, + "throughput_tokens_per_sec": 715.6, + "tokens_out": 42902, + "tokens_in": 0, + "requests_completed": 120, + "ttft_ms_p50": 34.3, + "ttft_ms_p99": 36.2 + }, + { + "minute": 17.0, + "is_warmup": false, + "throughput_tokens_per_sec": 699.2, + "tokens_out": 41971, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 37.1 + }, + { + "minute": 18.0, + "is_warmup": false, + "throughput_tokens_per_sec": 721.0, + "tokens_out": 43276, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 39.7 + }, + { + "minute": 19.0, + "is_warmup": false, + "throughput_tokens_per_sec": 689.9, + "tokens_out": 41386, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.0 + }, + { + "minute": 20.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.4, + "tokens_out": 43086, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 36.1 + }, + { + "minute": 21.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.0, + "tokens_out": 43224, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.0 + }, + { + "minute": 22.0, + "is_warmup": false, + "throughput_tokens_per_sec": 726.1, + "tokens_out": 43543, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.6 + }, + { + "minute": 23.0, + "is_warmup": false, + "throughput_tokens_per_sec": 713.8, + "tokens_out": 42835, + "tokens_in": 0, + "requests_completed": 128, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 36.2 + }, + { + "minute": 24.0, + "is_warmup": false, + "throughput_tokens_per_sec": 694.3, + "tokens_out": 41650, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 38.5 + }, + { + "minute": 25.0, + "is_warmup": false, + "throughput_tokens_per_sec": 709.4, + "tokens_out": 42580, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 40.6 + }, + { + "minute": 26.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.1, + "tokens_out": 43188, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 36.1 + }, + { + "minute": 27.0, + "is_warmup": false, + "throughput_tokens_per_sec": 714.8, + "tokens_out": 42892, + "tokens_in": 0, + "requests_completed": 126, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 40.1 + }, + { + "minute": 28.0, + "is_warmup": false, + "throughput_tokens_per_sec": 705.6, + "tokens_out": 42347, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 40.7 + }, + { + "minute": 29.0, + "is_warmup": false, + "throughput_tokens_per_sec": 725.1, + "tokens_out": 43505, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 40.7 + } + ], + "sustained_throughput_tokens_per_sec": 712.3, + "throttle_ratio": 0.947, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -0.2 + }, + "burst": { + "sla_ttft_ms": 500, + "burst_steady_qps": 5, + "burst_peak_qps": 25, + "burst_duration_seconds": 30, + "burst_interval_seconds": 120, + "steady_requests_total": 1812, + "burst_requests_total": 2245, + "steady_ttft_p50_ms": 39.39, + "steady_ttft_p99_ms": 79.1, + "burst_ttft_p50_ms": 7082.87, + "burst_ttft_p99_ms": 17212.99, + "sla_met_during_burst": false, + "burst_degradation_ratio": 217.605, + "results_by_cycle": [ + { + "cycle": 1, + "steady_requests": 581, + "burst_requests": 760, + "steady_ttft_p99_ms": 89.81, + "burst_ttft_p99_ms": 17855.37 + }, + { + "cycle": 2, + "steady_requests": 595, + "burst_requests": 734, + "steady_ttft_p99_ms": 47.72, + "burst_ttft_p99_ms": 16592.12 + }, + { + "cycle": 3, + "steady_requests": 636, + "burst_requests": 751, + "steady_ttft_p99_ms": 48.05, + "burst_ttft_p99_ms": 16579.9 + } + ] + } + }, + "accuracy": { + "subset_score": 0.61, + "baseline_delta": 0.01, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "04:35:38", + "run_id": "8f83bfab", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": "Partial run: ['offline', 'online', 'interactive', 'sustained', 'burst'] succeeded, ['speculative'] failed.", + "benchmark_start_time": "2026-05-18T04:33:49.881300+00:00", + "benchmark_end_time": "2026-05-18T04:35:38.648555+00:00", + "benchmark_elapsed_minutes": 75.5, + "model_load_seconds": 58.5, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'online', 'interactive', 'sustained', 'burst'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/offline", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/online", + "interactive": "results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/interactive", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/sustained", + "burst": "results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/burst" + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/sustained/result.json new file mode 100644 index 0000000..1fc95fb --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab/sustained/result.json @@ -0,0 +1,424 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_A", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T04:31:01.283634+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Meta-Llama-3-8B-Instruct", + "model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 30, + "warmup_minutes": 2, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": true, + "throughput_tokens_per_sec": 666.3, + "tokens_out": 39988, + "tokens_in": 0, + "requests_completed": 115, + "ttft_ms_p50": 43.8, + "ttft_ms_p99": 345.6 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 707.4, + "tokens_out": 42459, + "tokens_in": 0, + "requests_completed": 126, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.9 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 710.2, + "tokens_out": 42613, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 35.9 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 728.3, + "tokens_out": 43675, + "tokens_in": 0, + "requests_completed": 126, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 707.4, + "tokens_out": 42459, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.3, + "ttft_ms_p99": 41.4 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.2, + "tokens_out": 42670, + "tokens_in": 0, + "requests_completed": 128, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.3 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 708.9, + "tokens_out": 42534, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 35.8 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.6, + "tokens_out": 42083, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 40.0 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 723.0, + "tokens_out": 43388, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 39.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 702.0, + "tokens_out": 42108, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 39.2 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 716.4, + "tokens_out": 42995, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 39.5 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.5, + "tokens_out": 43125, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.7 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 716.5, + "tokens_out": 42976, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.0 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 708.7, + "tokens_out": 42543, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.1 + }, + { + "minute": 15.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.1, + "tokens_out": 42666, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.9 + }, + { + "minute": 16.0, + "is_warmup": false, + "throughput_tokens_per_sec": 715.6, + "tokens_out": 42902, + "tokens_in": 0, + "requests_completed": 120, + "ttft_ms_p50": 34.3, + "ttft_ms_p99": 36.2 + }, + { + "minute": 17.0, + "is_warmup": false, + "throughput_tokens_per_sec": 699.2, + "tokens_out": 41971, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 37.1 + }, + { + "minute": 18.0, + "is_warmup": false, + "throughput_tokens_per_sec": 721.0, + "tokens_out": 43276, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 39.7 + }, + { + "minute": 19.0, + "is_warmup": false, + "throughput_tokens_per_sec": 689.9, + "tokens_out": 41386, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.0 + }, + { + "minute": 20.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.4, + "tokens_out": 43086, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 36.1 + }, + { + "minute": 21.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.0, + "tokens_out": 43224, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.0 + }, + { + "minute": 22.0, + "is_warmup": false, + "throughput_tokens_per_sec": 726.1, + "tokens_out": 43543, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 40.6 + }, + { + "minute": 23.0, + "is_warmup": false, + "throughput_tokens_per_sec": 713.8, + "tokens_out": 42835, + "tokens_in": 0, + "requests_completed": 128, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 36.2 + }, + { + "minute": 24.0, + "is_warmup": false, + "throughput_tokens_per_sec": 694.3, + "tokens_out": 41650, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 38.5 + }, + { + "minute": 25.0, + "is_warmup": false, + "throughput_tokens_per_sec": 709.4, + "tokens_out": 42580, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.7, + "ttft_ms_p99": 40.6 + }, + { + "minute": 26.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.1, + "tokens_out": 43188, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 36.1 + }, + { + "minute": 27.0, + "is_warmup": false, + "throughput_tokens_per_sec": 714.8, + "tokens_out": 42892, + "tokens_in": 0, + "requests_completed": 126, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 40.1 + }, + { + "minute": 28.0, + "is_warmup": false, + "throughput_tokens_per_sec": 705.6, + "tokens_out": 42347, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 40.7 + }, + { + "minute": 29.0, + "is_warmup": false, + "throughput_tokens_per_sec": 725.1, + "tokens_out": 43505, + "tokens_in": 0, + "requests_completed": 125, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 40.7 + } + ], + "sustained_throughput_tokens_per_sec": 712.3, + "throttle_ratio": 0.947, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -0.2 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "05:45:22", + "run_id": "8f83bfab", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm020_0f6c56e4_8f83bfab", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T05:15:13.541588+00:00", + "benchmark_end_time": "2026-05-18T05:45:22.333860+00:00", + "benchmark_elapsed_minutes": 30.1, + "model_load_seconds": 41.7 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/accuracy/accuracy.json new file mode 100644 index 0000000..95fced5 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.56, + "baseline_delta": 0.0, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/offline/result.json new file mode 100644 index 0000000..b275976 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/offline/result.json @@ -0,0 +1,178 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3888.91, + "throughput_tokens_per_sec_per_chip": 3888.91, + "throughput_tokens_per_sec_total": 6956.79, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3886.06, + "throughput_tokens_per_sec_per_chip": 3886.06, + "throughput_tokens_per_sec_total": 6943.56, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3885.21, + "throughput_tokens_per_sec_per_chip": 3885.21, + "throughput_tokens_per_sec_total": 6935.62, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3887.73, + "throughput_tokens_per_sec_per_chip": 3887.73, + "throughput_tokens_per_sec_total": 6949.35, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "06:01:31", + "run_id": "ffd81462", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T05:59:03.115858+00:00", + "benchmark_end_time": "2026-05-18T06:01:31.820089+00:00", + "benchmark_elapsed_minutes": 2.5, + "model_load_seconds": 35.2 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/online/result.json new file mode 100644 index 0000000..fcd1a85 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/online/result.json @@ -0,0 +1,176 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 41.15, + "ttft_ms_p90": 61.6, + "ttft_ms_p99": 96.42, + "tpot_ms_p50": 13.32, + "tpot_ms_p90": 14.41, + "tpot_ms_p99": 14.81, + "elapsed_seconds_median": 68.9, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 52.06, + "ttft_ms_p90": 63.35, + "ttft_ms_p99": 69.96, + "tpot_ms_p50": 17.62, + "tpot_ms_p90": 18.95, + "tpot_ms_p99": 19.52, + "elapsed_seconds_median": 36.5, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 86.32, + "ttft_ms_p90": 5056.12, + "ttft_ms_p99": 5979.32, + "tpot_ms_p50": 22.47, + "tpot_ms_p90": 25.06, + "tpot_ms_p99": 26.76, + "elapsed_seconds_median": 26.1, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 891.06, + "ttft_ms_p90": 8527.64, + "ttft_ms_p99": 9213.57, + "tpot_ms_p50": 22.37, + "tpot_ms_p90": 24.86, + "tpot_ms_p99": 30.84, + "elapsed_seconds_median": 23.9, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "06:10:41", + "run_id": "ffd81462", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T06:02:58.126932+00:00", + "benchmark_end_time": "2026-05-18T06:10:41.342285+00:00", + "benchmark_elapsed_minutes": 7.7, + "model_load_seconds": 60.4 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/result.json new file mode 100644 index 0000000..75e68ff --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/result.json @@ -0,0 +1,395 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "online", + "sustained" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": null + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3888.91, + "throughput_tokens_per_sec_per_chip": 3888.91, + "throughput_tokens_per_sec_total": 6956.79, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3886.06, + "throughput_tokens_per_sec_per_chip": 3886.06, + "throughput_tokens_per_sec_total": 6943.56, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3885.21, + "throughput_tokens_per_sec_per_chip": 3885.21, + "throughput_tokens_per_sec_total": 6935.62, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3887.73, + "throughput_tokens_per_sec_per_chip": 3887.73, + "throughput_tokens_per_sec_total": 6949.35, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 41.15, + "ttft_ms_p90": 61.6, + "ttft_ms_p99": 96.42, + "tpot_ms_p50": 13.32, + "tpot_ms_p90": 14.41, + "tpot_ms_p99": 14.81, + "elapsed_seconds_median": 68.9, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 52.06, + "ttft_ms_p90": 63.35, + "ttft_ms_p99": 69.96, + "tpot_ms_p50": 17.62, + "tpot_ms_p90": 18.95, + "tpot_ms_p99": 19.52, + "elapsed_seconds_median": 36.5, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 86.32, + "ttft_ms_p90": 5056.12, + "ttft_ms_p99": 5979.32, + "tpot_ms_p50": 22.47, + "tpot_ms_p90": 25.06, + "tpot_ms_p99": 26.76, + "elapsed_seconds_median": 26.1, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 891.06, + "ttft_ms_p90": 8527.64, + "ttft_ms_p99": 9213.57, + "tpot_ms_p50": 22.37, + "tpot_ms_p90": 24.86, + "tpot_ms_p99": 30.84, + "elapsed_seconds_median": 23.9, + "sla_met": false + } + ] + }, + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 655.0, + "tokens_out": 39334, + "tokens_in": 0, + "requests_completed": 109, + "ttft_ms_p50": 44.4, + "ttft_ms_p99": 401.8 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.0, + "tokens_out": 42652, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 42.5 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 698.3, + "tokens_out": 41902, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.3 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.8, + "tokens_out": 43114, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.5 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 724.4, + "tokens_out": 43451, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 41.7 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.8, + "tokens_out": 42133, + "tokens_in": 0, + "requests_completed": 115, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 706.9, + "tokens_out": 42401, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 42.9 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.6, + "tokens_out": 43232, + "tokens_in": 0, + "requests_completed": 120, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 41.5 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 697.1, + "tokens_out": 41830, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 43.9 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 726.2, + "tokens_out": 43597, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 35.9 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.9, + "tokens_out": 42083, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 728.3, + "tokens_out": 43715, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 43.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 688.9, + "tokens_out": 41331, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.5 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 717.5, + "tokens_out": 43059, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.7 + } + ], + "sustained_throughput_tokens_per_sec": 706.9, + "throttle_ratio": 0.899, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -360.1 + } + }, + "accuracy": { + "subset_score": 0.56, + "baseline_delta": 0.0, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "06:01:31", + "run_id": "ffd81462", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T05:59:03.115858+00:00", + "benchmark_end_time": "2026-05-18T06:01:31.820089+00:00", + "benchmark_elapsed_minutes": 25.3, + "model_load_seconds": 35.2, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'online', 'sustained'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/offline", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/online", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/sustained" + } + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/sustained/result.json new file mode 100644 index 0000000..ee4ebab --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/sustained/result.json @@ -0,0 +1,274 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 655.0, + "tokens_out": 39334, + "tokens_in": 0, + "requests_completed": 109, + "ttft_ms_p50": 44.4, + "ttft_ms_p99": 401.8 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.0, + "tokens_out": 42652, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 42.5 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 698.3, + "tokens_out": 41902, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.3 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.8, + "tokens_out": 43114, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.5 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 724.4, + "tokens_out": 43451, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 41.7 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.8, + "tokens_out": 42133, + "tokens_in": 0, + "requests_completed": 115, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 706.9, + "tokens_out": 42401, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 42.9 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.6, + "tokens_out": 43232, + "tokens_in": 0, + "requests_completed": 120, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 41.5 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 697.1, + "tokens_out": 41830, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 43.9 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 726.2, + "tokens_out": 43597, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 35.9 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.9, + "tokens_out": 42083, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 728.3, + "tokens_out": 43715, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 43.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 688.9, + "tokens_out": 41331, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.5 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 717.5, + "tokens_out": 43059, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.7 + } + ], + "sustained_throughput_tokens_per_sec": 706.9, + "throttle_ratio": 0.899, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -360.1 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "06:26:58", + "run_id": "ffd81462", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T06:11:50.049074+00:00", + "benchmark_end_time": "2026-05-18T06:26:58.575027+00:00", + "benchmark_elapsed_minutes": 15.1, + "model_load_seconds": 41.2 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/env_info.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/env_info.json new file mode 100644 index 0000000..a73d417 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/env_info.json @@ -0,0 +1,49 @@ +{ + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/result.json new file mode 100644 index 0000000..32d0a7b --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/result.json @@ -0,0 +1,1519 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original", + "_note": "suite model_id. Each precision level uses its own quantized checkpoint." + }, + "task": { + "scenarios_run": [ + "accuracy", + "offline", + "online", + "sustained" + ], + "precision_levels_run": [ + "BF16", + "FP8", + "W8A8", + "W8A16", + "W4A16" + ], + "precision_levels_skipped": [ + "FP16" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": null + }, + "metrics": { + "quantization": { + "results_by_precision": [ + { + "precision": "BF16", + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "best_throughput_tokens_per_sec": 3888.91, + "accuracy_score": 0.56, + "accuracy_baseline_delta": 0.0, + "accuracy_valid": true, + "quality_efficiency": 2177.8, + "speedup_vs_bf16": 1.0, + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3888.91, + "throughput_tokens_per_sec_per_chip": 3888.91, + "throughput_tokens_per_sec_total": 6956.79, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3886.06, + "throughput_tokens_per_sec_per_chip": 3886.06, + "throughput_tokens_per_sec_total": 6943.56, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3885.21, + "throughput_tokens_per_sec_per_chip": 3885.21, + "throughput_tokens_per_sec_total": 6935.62, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3887.73, + "throughput_tokens_per_sec_per_chip": 3887.73, + "throughput_tokens_per_sec_total": 6949.35, + "elapsed_seconds_median": 9.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ], + "result_dir": "bf16", + "effective_dtype": "bfloat16", + "quantization_method": null + }, + { + "precision": "FP8", + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", + "best_throughput_tokens_per_sec": 4141.71, + "accuracy_score": 0.0, + "accuracy_baseline_delta": -0.58, + "accuracy_valid": false, + "quality_efficiency": null, + "speedup_vs_bf16": 1.065, + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 4141.71, + "throughput_tokens_per_sec_per_chip": 4141.71, + "throughput_tokens_per_sec_total": 6418.35, + "elapsed_seconds_median": 12.4, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 4130.72, + "throughput_tokens_per_sec_per_chip": 4130.72, + "throughput_tokens_per_sec_total": 6401.32, + "elapsed_seconds_median": 12.4, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 4124.42, + "throughput_tokens_per_sec_per_chip": 4124.42, + "throughput_tokens_per_sec_total": 6391.57, + "elapsed_seconds_median": 12.4, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 4131.44, + "throughput_tokens_per_sec_per_chip": 4131.44, + "throughput_tokens_per_sec_total": 6402.45, + "elapsed_seconds_median": 12.4, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ], + "result_dir": "fp8", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors" + }, + { + "precision": "W8A8", + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", + "best_throughput_tokens_per_sec": 3208.11, + "accuracy_score": 0.59, + "accuracy_baseline_delta": 0.0, + "accuracy_valid": true, + "quality_efficiency": 1892.8, + "speedup_vs_bf16": 0.825, + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3208.11, + "throughput_tokens_per_sec_per_chip": 3208.11, + "throughput_tokens_per_sec_total": 5840.36, + "elapsed_seconds_median": 10.7, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3140.16, + "throughput_tokens_per_sec_per_chip": 3140.16, + "throughput_tokens_per_sec_total": 5706.63, + "elapsed_seconds_median": 11.0, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3193.23, + "throughput_tokens_per_sec_per_chip": 3193.23, + "throughput_tokens_per_sec_total": 5813.28, + "elapsed_seconds_median": 10.7, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3175.58, + "throughput_tokens_per_sec_per_chip": 3175.58, + "throughput_tokens_per_sec_total": 5786.77, + "elapsed_seconds_median": 10.8, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ], + "result_dir": "w8a8", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors" + }, + { + "precision": "W8A16", + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", + "best_throughput_tokens_per_sec": 3547.44, + "accuracy_score": 0.58, + "accuracy_baseline_delta": -0.01, + "accuracy_valid": true, + "quality_efficiency": 2057.5, + "speedup_vs_bf16": 0.912, + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3533.68, + "throughput_tokens_per_sec_per_chip": 3533.68, + "throughput_tokens_per_sec_total": 6328.84, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3510.7, + "throughput_tokens_per_sec_per_chip": 3510.7, + "throughput_tokens_per_sec_total": 6292.5, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3535.13, + "throughput_tokens_per_sec_per_chip": 3535.13, + "throughput_tokens_per_sec_total": 6324.07, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3547.44, + "throughput_tokens_per_sec_per_chip": 3547.44, + "throughput_tokens_per_sec_total": 6336.33, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ], + "result_dir": "w8a16", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors" + }, + { + "precision": "W4A16", + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16", + "best_throughput_tokens_per_sec": 1889.19, + "accuracy_score": 0.56, + "accuracy_baseline_delta": -0.01, + "accuracy_valid": true, + "quality_efficiency": 1057.9, + "speedup_vs_bf16": 0.486, + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 1889.19, + "throughput_tokens_per_sec_per_chip": 1889.19, + "throughput_tokens_per_sec_total": 3433.47, + "elapsed_seconds_median": 18.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 1862.45, + "throughput_tokens_per_sec_per_chip": 1862.45, + "throughput_tokens_per_sec_total": 3376.95, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 1861.34, + "throughput_tokens_per_sec_per_chip": 1861.34, + "throughput_tokens_per_sec_total": 3375.2, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 1851.04, + "throughput_tokens_per_sec_per_chip": 1851.04, + "throughput_tokens_per_sec_total": 3367.3, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ], + "result_dir": "w4a16", + "effective_dtype": "float16", + "quantization_method": "gptq" + } + ] + }, + "derived": {}, + "quantization_online": { + "results_by_precision": [ + { + "precision": "BF16", + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 41.15, + "ttft_ms_p90": 61.6, + "ttft_ms_p99": 96.42, + "tpot_ms_p50": 13.32, + "tpot_ms_p90": 14.41, + "tpot_ms_p99": 14.81, + "elapsed_seconds_median": 68.9, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 52.06, + "ttft_ms_p90": 63.35, + "ttft_ms_p99": 69.96, + "tpot_ms_p50": 17.62, + "tpot_ms_p90": 18.95, + "tpot_ms_p99": 19.52, + "elapsed_seconds_median": 36.5, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 86.32, + "ttft_ms_p90": 5056.12, + "ttft_ms_p99": 5979.32, + "tpot_ms_p50": 22.47, + "tpot_ms_p90": 25.06, + "tpot_ms_p99": 26.76, + "elapsed_seconds_median": 26.1, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 891.06, + "ttft_ms_p90": 8527.64, + "ttft_ms_p99": 9213.57, + "tpot_ms_p50": 22.37, + "tpot_ms_p90": 24.86, + "tpot_ms_p99": 30.84, + "elapsed_seconds_median": 23.9, + "sla_met": false + } + ] + }, + { + "precision": "FP8", + "max_valid_qps": 5, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 53.12, + "ttft_ms_p90": 73.69, + "ttft_ms_p99": 115.17, + "tpot_ms_p50": 18.68, + "tpot_ms_p90": 20.12, + "tpot_ms_p99": 20.68, + "elapsed_seconds_median": 72.2, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 84.99, + "ttft_ms_p90": 1892.71, + "ttft_ms_p99": 3191.04, + "tpot_ms_p50": 26.19, + "tpot_ms_p90": 27.78, + "tpot_ms_p99": 28.06, + "elapsed_seconds_median": 43.0, + "sla_met": false + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 6847.07, + "ttft_ms_p90": 15210.93, + "ttft_ms_p99": 16362.21, + "tpot_ms_p50": 25.97, + "tpot_ms_p90": 26.93, + "tpot_ms_p99": 27.06, + "elapsed_seconds_median": 38.8, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 10056.63, + "ttft_ms_p90": 20353.69, + "ttft_ms_p99": 21149.06, + "tpot_ms_p50": 25.43, + "tpot_ms_p90": 26.15, + "tpot_ms_p99": 26.2, + "elapsed_seconds_median": 37.3, + "sla_met": false + } + ] + }, + { + "precision": "W8A8", + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 55.34, + "ttft_ms_p90": 63.29, + "ttft_ms_p99": 69.75, + "tpot_ms_p50": 20.67, + "tpot_ms_p90": 20.88, + "tpot_ms_p99": 21.3, + "elapsed_seconds_median": 72.9, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.42, + "ttft_ms_p90": 66.6, + "ttft_ms_p99": 69.92, + "tpot_ms_p50": 21.28, + "tpot_ms_p90": 22.19, + "tpot_ms_p99": 22.28, + "elapsed_seconds_median": 39.6, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 74.55, + "ttft_ms_p90": 4438.81, + "ttft_ms_p99": 5421.82, + "tpot_ms_p50": 22.53, + "tpot_ms_p90": 23.69, + "tpot_ms_p99": 25.14, + "elapsed_seconds_median": 27.7, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 985.11, + "ttft_ms_p90": 8331.22, + "ttft_ms_p99": 8868.55, + "tpot_ms_p50": 23.38, + "tpot_ms_p90": 24.38, + "tpot_ms_p99": 26.79, + "elapsed_seconds_median": 25.6, + "sla_met": false + } + ] + }, + { + "precision": "W8A16", + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 46.32, + "ttft_ms_p90": 62.37, + "ttft_ms_p99": 104.49, + "tpot_ms_p50": 16.66, + "tpot_ms_p90": 17.7, + "tpot_ms_p99": 18.28, + "elapsed_seconds_median": 70.3, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.25, + "ttft_ms_p90": 72.45, + "ttft_ms_p99": 81.09, + "tpot_ms_p50": 20.79, + "tpot_ms_p90": 22.4, + "tpot_ms_p99": 23.2, + "elapsed_seconds_median": 37.9, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 93.4, + "ttft_ms_p90": 6429.1, + "ttft_ms_p99": 7429.34, + "tpot_ms_p50": 25.13, + "tpot_ms_p90": 28.01, + "tpot_ms_p99": 30.79, + "elapsed_seconds_median": 28.8, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 1306.21, + "ttft_ms_p90": 9937.77, + "ttft_ms_p99": 10640.43, + "tpot_ms_p50": 25.06, + "tpot_ms_p90": 27.42, + "tpot_ms_p99": 34.24, + "elapsed_seconds_median": 26.4, + "sla_met": false + } + ] + }, + { + "precision": "W4A16", + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 50.42, + "ttft_ms_p90": 64.62, + "ttft_ms_p99": 104.73, + "tpot_ms_p50": 18.15, + "tpot_ms_p90": 19.16, + "tpot_ms_p99": 19.62, + "elapsed_seconds_median": 71.1, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.89, + "ttft_ms_p90": 72.82, + "ttft_ms_p99": 84.96, + "tpot_ms_p50": 21.11, + "tpot_ms_p90": 23.03, + "tpot_ms_p99": 24.31, + "elapsed_seconds_median": 38.5, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 97.24, + "ttft_ms_p90": 6365.08, + "ttft_ms_p99": 7073.94, + "tpot_ms_p50": 25.46, + "tpot_ms_p90": 27.98, + "tpot_ms_p99": 31.21, + "elapsed_seconds_median": 29.2, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 918.12, + "ttft_ms_p90": 9805.4, + "ttft_ms_p99": 10437.5, + "tpot_ms_p50": 25.2, + "tpot_ms_p90": 27.67, + "tpot_ms_p99": 32.45, + "elapsed_seconds_median": 26.7, + "sla_met": false + } + ] + } + ] + }, + "quantization_sustained": { + "results_by_precision": [ + { + "precision": "BF16", + "sustained_throughput_tokens_per_sec": 706.9, + "throttle_ratio": 0.899, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -360.1, + "sustained_concurrency": 8, + "duration_minutes": 15, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 655.0, + "tokens_out": 39334, + "tokens_in": 0, + "requests_completed": 109, + "ttft_ms_p50": 44.4, + "ttft_ms_p99": 401.8 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 711.0, + "tokens_out": 42652, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 42.5 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 698.3, + "tokens_out": 41902, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.3 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 718.8, + "tokens_out": 43114, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 36.5 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 724.4, + "tokens_out": 43451, + "tokens_in": 0, + "requests_completed": 124, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 41.7 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.8, + "tokens_out": 42133, + "tokens_in": 0, + "requests_completed": 115, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 706.9, + "tokens_out": 42401, + "tokens_in": 0, + "requests_completed": 122, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 42.9 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 720.6, + "tokens_out": 43232, + "tokens_in": 0, + "requests_completed": 120, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 41.5 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 697.1, + "tokens_out": 41830, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 43.9 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 726.2, + "tokens_out": 43597, + "tokens_in": 0, + "requests_completed": 123, + "ttft_ms_p50": 34.4, + "ttft_ms_p99": 35.9 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 701.9, + "tokens_out": 42083, + "tokens_in": 0, + "requests_completed": 116, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 36.1 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 728.3, + "tokens_out": 43715, + "tokens_in": 0, + "requests_completed": 121, + "ttft_ms_p50": 34.5, + "ttft_ms_p99": 43.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 688.9, + "tokens_out": 41331, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.5 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 717.5, + "tokens_out": 43059, + "tokens_in": 0, + "requests_completed": 119, + "ttft_ms_p50": 34.6, + "ttft_ms_p99": 41.7 + } + ] + }, + { + "precision": "FP8", + "sustained_throughput_tokens_per_sec": 438.9, + "throttle_ratio": 0.856, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -644.7, + "sustained_concurrency": 8, + "duration_minutes": 15, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.6, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 178.1, + "ttft_ms_p99": 701.6 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.6, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 186.0, + "ttft_ms_p99": 236.5 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 477.8, + "tokens_out": 28672, + "tokens_in": 0, + "requests_completed": 56, + "ttft_ms_p50": 51.0, + "ttft_ms_p99": 125.3 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.5, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 50.3, + "ttft_ms_p99": 55.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 478.0, + "tokens_out": 28672, + "tokens_in": 0, + "requests_completed": 56, + "ttft_ms_p50": 50.6, + "ttft_ms_p99": 54.6 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.3, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 51.2, + "ttft_ms_p99": 57.3 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 478.2, + "tokens_out": 28672, + "tokens_in": 0, + "requests_completed": 56, + "ttft_ms_p50": 50.5, + "ttft_ms_p99": 56.4 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.4, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 51.3, + "ttft_ms_p99": 62.9 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 478.0, + "tokens_out": 28672, + "tokens_in": 0, + "requests_completed": 56, + "ttft_ms_p50": 51.1, + "ttft_ms_p99": 59.5 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.4, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 50.5, + "ttft_ms_p99": 55.7 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 478.2, + "tokens_out": 28672, + "tokens_in": 0, + "requests_completed": 56, + "ttft_ms_p50": 51.4, + "ttft_ms_p99": 58.1 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.6, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 51.6, + "ttft_ms_p99": 59.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 477.6, + "tokens_out": 28672, + "tokens_in": 0, + "requests_completed": 56, + "ttft_ms_p50": 51.4, + "ttft_ms_p99": 58.2 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.8, + "tokens_out": 24576, + "tokens_in": 0, + "requests_completed": 48, + "ttft_ms_p50": 51.3, + "ttft_ms_p99": 56.9 + } + ] + }, + { + "precision": "W8A8", + "sustained_throughput_tokens_per_sec": 399.4, + "throttle_ratio": 0.879, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -331.5, + "sustained_concurrency": 8, + "duration_minutes": 15, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 366.9, + "tokens_out": 22031, + "tokens_in": 0, + "requests_completed": 63, + "ttft_ms_p50": 59.1, + "ttft_ms_p99": 396.4 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 402.7, + "tokens_out": 24148, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.8, + "ttft_ms_p99": 63.4 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 400.8, + "tokens_out": 24050, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.5, + "ttft_ms_p99": 61.0 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 402.1, + "tokens_out": 24127, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.0, + "ttft_ms_p99": 61.2 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 395.3, + "tokens_out": 23722, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.7, + "ttft_ms_p99": 64.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 398.4, + "tokens_out": 23893, + "tokens_in": 0, + "requests_completed": 64, + "ttft_ms_p50": 58.7, + "ttft_ms_p99": 63.0 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 410.0, + "tokens_out": 24605, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.3, + "ttft_ms_p99": 62.1 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 398.6, + "tokens_out": 23918, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 59.5, + "ttft_ms_p99": 65.7 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 390.3, + "tokens_out": 23418, + "tokens_in": 0, + "requests_completed": 64, + "ttft_ms_p50": 58.3, + "ttft_ms_p99": 62.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 417.2, + "tokens_out": 25045, + "tokens_in": 0, + "requests_completed": 69, + "ttft_ms_p50": 58.2, + "ttft_ms_p99": 61.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 391.1, + "tokens_out": 23462, + "tokens_in": 0, + "requests_completed": 70, + "ttft_ms_p50": 57.8, + "ttft_ms_p99": 62.4 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 408.7, + "tokens_out": 24514, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.5, + "ttft_ms_p99": 65.6 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 415.2, + "tokens_out": 24925, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.2, + "ttft_ms_p99": 61.3 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 395.0, + "tokens_out": 23687, + "tokens_in": 0, + "requests_completed": 67, + "ttft_ms_p50": 58.0, + "ttft_ms_p99": 64.9 + } + ] + }, + { + "precision": "W8A16", + "sustained_throughput_tokens_per_sec": 494.1, + "throttle_ratio": 0.905, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -320.5, + "sustained_concurrency": 8, + "duration_minutes": 15, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 456.8, + "tokens_out": 27416, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 50.0, + "ttft_ms_p99": 372.0 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 504.0, + "tokens_out": 30242, + "tokens_in": 0, + "requests_completed": 82, + "ttft_ms_p50": 47.1, + "ttft_ms_p99": 71.8 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 486.6, + "tokens_out": 29207, + "tokens_in": 0, + "requests_completed": 83, + "ttft_ms_p50": 47.1, + "ttft_ms_p99": 54.4 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 499.0, + "tokens_out": 29921, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 52.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 496.0, + "tokens_out": 29768, + "tokens_in": 0, + "requests_completed": 79, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 49.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 498.3, + "tokens_out": 29901, + "tokens_in": 0, + "requests_completed": 84, + "ttft_ms_p50": 47.0, + "ttft_ms_p99": 52.3 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 495.4, + "tokens_out": 29715, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.7, + "ttft_ms_p99": 50.2 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 496.1, + "tokens_out": 29779, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 53.9 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 503.2, + "tokens_out": 30195, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 47.3, + "ttft_ms_p99": 54.0 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 489.6, + "tokens_out": 29369, + "tokens_in": 0, + "requests_completed": 83, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 52.3 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 504.8, + "tokens_out": 30299, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 52.2 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 500.4, + "tokens_out": 30017, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.7, + "ttft_ms_p99": 50.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 494.6, + "tokens_out": 29670, + "tokens_in": 0, + "requests_completed": 84, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 51.2 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 492.1, + "tokens_out": 29528, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 47.0, + "ttft_ms_p99": 51.5 + } + ] + }, + { + "precision": "W4A16", + "sustained_throughput_tokens_per_sec": 437.3, + "throttle_ratio": 0.897, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -632.2, + "sustained_concurrency": 8, + "duration_minutes": 15, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.4, + "tokens_out": 24574, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 55.4, + "ttft_ms_p99": 690.1 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 431.5, + "tokens_out": 25899, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 74.1 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 456.2, + "tokens_out": 27364, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.1, + "ttft_ms_p99": 56.8 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 439.7, + "tokens_out": 26374, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.2, + "ttft_ms_p99": 58.1 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 430.2, + "tokens_out": 25819, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.5, + "ttft_ms_p99": 56.1 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 432.7, + "tokens_out": 25962, + "tokens_in": 0, + "requests_completed": 74, + "ttft_ms_p50": 53.9, + "ttft_ms_p99": 57.4 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 452.2, + "tokens_out": 27140, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 57.2 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 436.1, + "tokens_out": 26169, + "tokens_in": 0, + "requests_completed": 74, + "ttft_ms_p50": 53.6, + "ttft_ms_p99": 57.7 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 432.3, + "tokens_out": 25934, + "tokens_in": 0, + "requests_completed": 76, + "ttft_ms_p50": 53.2, + "ttft_ms_p99": 57.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 431.9, + "tokens_out": 25908, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 56.8 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 449.0, + "tokens_out": 26936, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 57.9 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 445.3, + "tokens_out": 26739, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.1, + "ttft_ms_p99": 59.4 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 441.9, + "tokens_out": 26490, + "tokens_in": 0, + "requests_completed": 78, + "ttft_ms_p50": 53.3, + "ttft_ms_p99": 55.8 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 433.6, + "tokens_out": 26022, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 53.0, + "ttft_ms_p99": 57.9 + } + ] + } + ] + } + }, + "accuracy": null, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "06:01:31", + "run_id": "ffd81462", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T05:59:03.115858+00:00", + "benchmark_end_time": "2026-05-18T06:01:31.820089+00:00", + "benchmark_elapsed_minutes": 134.3, + "model_load_seconds": 35.2, + "benchmark_elapsed_minutes_note": "Sum of per-precision benchmark_elapsed_minutes (excludes sleep gaps and orchestrator overhead).", + "scenario_dirs": { + "bf16/offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/offline", + "bf16/online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/online", + "bf16/sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/bf16/sustained", + "fp8/offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/fp8/offline", + "fp8/online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/fp8/online", + "fp8/sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/fp8/sustained", + "w8a8/offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/offline", + "w8a8/online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/online", + "w8a8/sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/sustained", + "w8a16/offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/offline", + "w8a16/online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/online", + "w8a16/sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/sustained", + "w4a16/offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/offline", + "w4a16/online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/online", + "w4a16/sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/sustained" + }, + "precision_dirs": { + "BF16": "bf16", + "FP8": "fp8", + "W8A8": "w8a8", + "W8A16": "w8a16", + "W4A16": "w4a16" + }, + "precision_model_map": { + "BF16": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "dtype_override": "bfloat16" + }, + "FP8": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", + "model_revision": "12fd6884d2585dd4d020373e7f39f74507b31866", + "engine_kwargs": { + "quantization": "compressed-tensors" + }, + "_note": "Static per-tensor FP8 (weights + activations). Requires Ampere+ (A100, A800, H20). Skipped automatically on FP16-only hardware." + }, + "W8A8": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", + "model_revision": "e2bfb7d92784ad7d1b606c2f9644d3cefb2ec708", + "engine_kwargs": { + "quantization": "compressed-tensors" + }, + "_note": "INT8 weights + INT8 activations via compressed-tensors. Exercises native int8 tensor cores." + }, + "W8A16": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", + "model_revision": "38e03ba250017bf8ed3eeecd3a744e21f6b994a9", + "engine_kwargs": { + "quantization": "compressed-tensors" + }, + "_note": "INT8 weights, FP16 activations. Weight-only quantization \u2014 reduces memory bandwidth, not compute dtype." + }, + "W4A16": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16", + "model_revision": "70371b1b0ea0d4eacfe1ee9056ee805629921c6e", + "engine_kwargs": { + "quantization": "gptq" + }, + "_note": "INT4 weights, FP16 activations via GPTQ Marlin kernels. Weight-only quantization \u2014 larger memory saving than W8A16." + } + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/accuracy/accuracy.json new file mode 100644 index 0000000..b3311bd --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.56, + "baseline_delta": -0.01, + "valid": true, + "framework": "vLLM", + "precision": "W4A16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/offline/result.json new file mode 100644 index 0000000..cc0424b --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/offline/result.json @@ -0,0 +1,183 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16", + "model_revision": "70371b1b0ea0d4eacfe1ee9056ee805629921c6e", + "model_name": null, + "model_note": "INT4 weight-only quantization by RedHatAI using AWQ. Weights INT4, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W4A16", + "effective_dtype": "float16", + "quantization_method": "gptq", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 1889.19, + "throughput_tokens_per_sec_per_chip": 1889.19, + "throughput_tokens_per_sec_total": 3433.47, + "elapsed_seconds_median": 18.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 1862.45, + "throughput_tokens_per_sec_per_chip": 1862.45, + "throughput_tokens_per_sec_total": 3376.95, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 1861.34, + "throughput_tokens_per_sec_per_chip": 1861.34, + "throughput_tokens_per_sec_total": 3375.2, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 1851.04, + "throughput_tokens_per_sec_per_chip": 1851.04, + "throughput_tokens_per_sec_total": 3367.3, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:44:34", + "run_id": "b1eb2d96", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_b1eb2d96", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:39:39.920688+00:00", + "benchmark_end_time": "2026-05-18T14:44:34.781477+00:00", + "benchmark_elapsed_minutes": 4.9, + "model_load_seconds": 18.5 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/online/result.json new file mode 100644 index 0000000..7261934 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/online/result.json @@ -0,0 +1,181 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16", + "model_revision": "70371b1b0ea0d4eacfe1ee9056ee805629921c6e", + "model_name": null, + "model_note": "INT4 weight-only quantization by RedHatAI using AWQ. Weights INT4, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W4A16", + "effective_dtype": null, + "quantization_method": "gptq", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 50.42, + "ttft_ms_p90": 64.62, + "ttft_ms_p99": 104.73, + "tpot_ms_p50": 18.15, + "tpot_ms_p90": 19.16, + "tpot_ms_p99": 19.62, + "elapsed_seconds_median": 71.1, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.89, + "ttft_ms_p90": 72.82, + "ttft_ms_p99": 84.96, + "tpot_ms_p50": 21.11, + "tpot_ms_p90": 23.03, + "tpot_ms_p99": 24.31, + "elapsed_seconds_median": 38.5, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 97.24, + "ttft_ms_p90": 6365.08, + "ttft_ms_p99": 7073.94, + "tpot_ms_p50": 25.46, + "tpot_ms_p90": 27.98, + "tpot_ms_p99": 31.21, + "elapsed_seconds_median": 29.2, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 918.12, + "ttft_ms_p90": 9805.4, + "ttft_ms_p99": 10437.5, + "tpot_ms_p50": 25.2, + "tpot_ms_p90": 27.67, + "tpot_ms_p99": 32.45, + "elapsed_seconds_median": 26.7, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:53:48", + "run_id": "b1eb2d96", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_b1eb2d96", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:45:34.287656+00:00", + "benchmark_end_time": "2026-05-18T14:53:48.716951+00:00", + "benchmark_elapsed_minutes": 8.2, + "model_load_seconds": 29.3 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/result.json new file mode 100644 index 0000000..27e0744 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/result.json @@ -0,0 +1,400 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16", + "model_revision": "70371b1b0ea0d4eacfe1ee9056ee805629921c6e", + "model_name": null, + "model_note": "INT4 weight-only quantization by RedHatAI using AWQ. Weights INT4, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W4A16", + "effective_dtype": "float16", + "quantization_method": "gptq", + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "online", + "sustained" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + } + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 1889.19, + "throughput_tokens_per_sec_per_chip": 1889.19, + "throughput_tokens_per_sec_total": 3433.47, + "elapsed_seconds_median": 18.2, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 1862.45, + "throughput_tokens_per_sec_per_chip": 1862.45, + "throughput_tokens_per_sec_total": 3376.95, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 1861.34, + "throughput_tokens_per_sec_per_chip": 1861.34, + "throughput_tokens_per_sec_total": 3375.2, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 1851.04, + "throughput_tokens_per_sec_per_chip": 1851.04, + "throughput_tokens_per_sec_total": 3367.3, + "elapsed_seconds_median": 18.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 50.42, + "ttft_ms_p90": 64.62, + "ttft_ms_p99": 104.73, + "tpot_ms_p50": 18.15, + "tpot_ms_p90": 19.16, + "tpot_ms_p99": 19.62, + "elapsed_seconds_median": 71.1, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.89, + "ttft_ms_p90": 72.82, + "ttft_ms_p99": 84.96, + "tpot_ms_p50": 21.11, + "tpot_ms_p90": 23.03, + "tpot_ms_p99": 24.31, + "elapsed_seconds_median": 38.5, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 97.24, + "ttft_ms_p90": 6365.08, + "ttft_ms_p99": 7073.94, + "tpot_ms_p50": 25.46, + "tpot_ms_p90": 27.98, + "tpot_ms_p99": 31.21, + "elapsed_seconds_median": 29.2, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 918.12, + "ttft_ms_p90": 9805.4, + "ttft_ms_p99": 10437.5, + "tpot_ms_p50": 25.2, + "tpot_ms_p90": 27.67, + "tpot_ms_p99": 32.45, + "elapsed_seconds_median": 26.7, + "sla_met": false + } + ] + }, + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.4, + "tokens_out": 24574, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 55.4, + "ttft_ms_p99": 690.1 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 431.5, + "tokens_out": 25899, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 74.1 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 456.2, + "tokens_out": 27364, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.1, + "ttft_ms_p99": 56.8 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 439.7, + "tokens_out": 26374, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.2, + "ttft_ms_p99": 58.1 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 430.2, + "tokens_out": 25819, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.5, + "ttft_ms_p99": 56.1 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 432.7, + "tokens_out": 25962, + "tokens_in": 0, + "requests_completed": 74, + "ttft_ms_p50": 53.9, + "ttft_ms_p99": 57.4 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 452.2, + "tokens_out": 27140, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 57.2 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 436.1, + "tokens_out": 26169, + "tokens_in": 0, + "requests_completed": 74, + "ttft_ms_p50": 53.6, + "ttft_ms_p99": 57.7 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 432.3, + "tokens_out": 25934, + "tokens_in": 0, + "requests_completed": 76, + "ttft_ms_p50": 53.2, + "ttft_ms_p99": 57.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 431.9, + "tokens_out": 25908, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 56.8 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 449.0, + "tokens_out": 26936, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 57.9 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 445.3, + "tokens_out": 26739, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.1, + "ttft_ms_p99": 59.4 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 441.9, + "tokens_out": 26490, + "tokens_in": 0, + "requests_completed": 78, + "ttft_ms_p50": 53.3, + "ttft_ms_p99": 55.8 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 433.6, + "tokens_out": 26022, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 53.0, + "ttft_ms_p99": 57.9 + } + ], + "sustained_throughput_tokens_per_sec": 437.3, + "throttle_ratio": 0.897, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -632.2 + } + }, + "accuracy": { + "subset_score": 0.56, + "baseline_delta": -0.01, + "valid": true, + "framework": "vLLM", + "precision": "W4A16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:44:34", + "run_id": "b1eb2d96", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_b1eb2d96", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:39:39.920688+00:00", + "benchmark_end_time": "2026-05-18T14:44:34.781477+00:00", + "benchmark_elapsed_minutes": 28.3, + "model_load_seconds": 18.5, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'online', 'sustained'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/offline", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/online", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/sustained" + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/sustained/result.json new file mode 100644 index 0000000..9982bb0 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w4a16/sustained/result.json @@ -0,0 +1,279 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16", + "model_revision": "70371b1b0ea0d4eacfe1ee9056ee805629921c6e", + "model_name": null, + "model_note": "INT4 weight-only quantization by RedHatAI using AWQ. Weights INT4, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W4A16", + "effective_dtype": null, + "quantization_method": "gptq", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 409.4, + "tokens_out": 24574, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 55.4, + "ttft_ms_p99": 690.1 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 431.5, + "tokens_out": 25899, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 74.1 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 456.2, + "tokens_out": 27364, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.1, + "ttft_ms_p99": 56.8 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 439.7, + "tokens_out": 26374, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.2, + "ttft_ms_p99": 58.1 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 430.2, + "tokens_out": 25819, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.5, + "ttft_ms_p99": 56.1 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 432.7, + "tokens_out": 25962, + "tokens_in": 0, + "requests_completed": 74, + "ttft_ms_p50": 53.9, + "ttft_ms_p99": 57.4 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 452.2, + "tokens_out": 27140, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 57.2 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 436.1, + "tokens_out": 26169, + "tokens_in": 0, + "requests_completed": 74, + "ttft_ms_p50": 53.6, + "ttft_ms_p99": 57.7 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 432.3, + "tokens_out": 25934, + "tokens_in": 0, + "requests_completed": 76, + "ttft_ms_p50": 53.2, + "ttft_ms_p99": 57.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 431.9, + "tokens_out": 25908, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 56.8 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 449.0, + "tokens_out": 26936, + "tokens_in": 0, + "requests_completed": 77, + "ttft_ms_p50": 53.4, + "ttft_ms_p99": 57.9 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 445.3, + "tokens_out": 26739, + "tokens_in": 0, + "requests_completed": 75, + "ttft_ms_p50": 53.1, + "ttft_ms_p99": 59.4 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 441.9, + "tokens_out": 26490, + "tokens_in": 0, + "requests_completed": 78, + "ttft_ms_p50": 53.3, + "ttft_ms_p99": 55.8 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 433.6, + "tokens_out": 26022, + "tokens_in": 0, + "requests_completed": 73, + "ttft_ms_p50": 53.0, + "ttft_ms_p99": 57.9 + } + ], + "sustained_throughput_tokens_per_sec": 437.3, + "throttle_ratio": 0.897, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -632.2 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "15:09:58", + "run_id": "b1eb2d96", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_b1eb2d96", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:54:46.184556+00:00", + "benchmark_end_time": "2026-05-18T15:09:58.042851+00:00", + "benchmark_elapsed_minutes": 15.2, + "model_load_seconds": 25.5 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/accuracy/accuracy.json new file mode 100644 index 0000000..a6505b1 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.58, + "baseline_delta": -0.01, + "valid": true, + "framework": "vLLM", + "precision": "W8A16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/offline/result.json new file mode 100644 index 0000000..6cb0dda --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/offline/result.json @@ -0,0 +1,183 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", + "model_revision": "38e03ba250017bf8ed3eeecd3a744e21f6b994a9", + "model_name": null, + "model_note": "INT8 weight-only quantization by RedHatAI using llm-compressor. Weights INT8, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A16", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3533.68, + "throughput_tokens_per_sec_per_chip": 3533.68, + "throughput_tokens_per_sec_total": 6328.84, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3510.7, + "throughput_tokens_per_sec_per_chip": 3510.7, + "throughput_tokens_per_sec_total": 6292.5, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3535.13, + "throughput_tokens_per_sec_per_chip": 3535.13, + "throughput_tokens_per_sec_total": 6324.07, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3547.44, + "throughput_tokens_per_sec_per_chip": 3547.44, + "throughput_tokens_per_sec_total": 6336.33, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:10:48", + "run_id": "5b72ecb7", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_5b72ecb7", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:08:05.948198+00:00", + "benchmark_end_time": "2026-05-18T14:10:48.869711+00:00", + "benchmark_elapsed_minutes": 2.7, + "model_load_seconds": 25.3 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/online/result.json new file mode 100644 index 0000000..8f59b18 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/online/result.json @@ -0,0 +1,181 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", + "model_revision": "38e03ba250017bf8ed3eeecd3a744e21f6b994a9", + "model_name": null, + "model_note": "INT8 weight-only quantization by RedHatAI using llm-compressor. Weights INT8, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A16", + "effective_dtype": null, + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 46.32, + "ttft_ms_p90": 62.37, + "ttft_ms_p99": 104.49, + "tpot_ms_p50": 16.66, + "tpot_ms_p90": 17.7, + "tpot_ms_p99": 18.28, + "elapsed_seconds_median": 70.3, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.25, + "ttft_ms_p90": 72.45, + "ttft_ms_p99": 81.09, + "tpot_ms_p50": 20.79, + "tpot_ms_p90": 22.4, + "tpot_ms_p99": 23.2, + "elapsed_seconds_median": 37.9, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 93.4, + "ttft_ms_p90": 6429.1, + "ttft_ms_p99": 7429.34, + "tpot_ms_p50": 25.13, + "tpot_ms_p90": 28.01, + "tpot_ms_p99": 30.79, + "elapsed_seconds_median": 28.8, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 1306.21, + "ttft_ms_p90": 9937.77, + "ttft_ms_p99": 10640.43, + "tpot_ms_p50": 25.06, + "tpot_ms_p90": 27.42, + "tpot_ms_p99": 34.24, + "elapsed_seconds_median": 26.4, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:20:00", + "run_id": "5b72ecb7", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_5b72ecb7", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:11:50.787859+00:00", + "benchmark_end_time": "2026-05-18T14:20:00.278335+00:00", + "benchmark_elapsed_minutes": 8.2, + "model_load_seconds": 34.3 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/result.json new file mode 100644 index 0000000..485a0fb --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/result.json @@ -0,0 +1,400 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", + "model_revision": "38e03ba250017bf8ed3eeecd3a744e21f6b994a9", + "model_name": null, + "model_note": "INT8 weight-only quantization by RedHatAI using llm-compressor. Weights INT8, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A16", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "online", + "sustained" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + } + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3533.68, + "throughput_tokens_per_sec_per_chip": 3533.68, + "throughput_tokens_per_sec_total": 6328.84, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3510.7, + "throughput_tokens_per_sec_per_chip": 3510.7, + "throughput_tokens_per_sec_total": 6292.5, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3535.13, + "throughput_tokens_per_sec_per_chip": 3535.13, + "throughput_tokens_per_sec_total": 6324.07, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3547.44, + "throughput_tokens_per_sec_per_chip": 3547.44, + "throughput_tokens_per_sec_total": 6336.33, + "elapsed_seconds_median": 10.1, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 46.32, + "ttft_ms_p90": 62.37, + "ttft_ms_p99": 104.49, + "tpot_ms_p50": 16.66, + "tpot_ms_p90": 17.7, + "tpot_ms_p99": 18.28, + "elapsed_seconds_median": 70.3, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.25, + "ttft_ms_p90": 72.45, + "ttft_ms_p99": 81.09, + "tpot_ms_p50": 20.79, + "tpot_ms_p90": 22.4, + "tpot_ms_p99": 23.2, + "elapsed_seconds_median": 37.9, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 93.4, + "ttft_ms_p90": 6429.1, + "ttft_ms_p99": 7429.34, + "tpot_ms_p50": 25.13, + "tpot_ms_p90": 28.01, + "tpot_ms_p99": 30.79, + "elapsed_seconds_median": 28.8, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 1306.21, + "ttft_ms_p90": 9937.77, + "ttft_ms_p99": 10640.43, + "tpot_ms_p50": 25.06, + "tpot_ms_p90": 27.42, + "tpot_ms_p99": 34.24, + "elapsed_seconds_median": 26.4, + "sla_met": false + } + ] + }, + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 456.8, + "tokens_out": 27416, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 50.0, + "ttft_ms_p99": 372.0 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 504.0, + "tokens_out": 30242, + "tokens_in": 0, + "requests_completed": 82, + "ttft_ms_p50": 47.1, + "ttft_ms_p99": 71.8 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 486.6, + "tokens_out": 29207, + "tokens_in": 0, + "requests_completed": 83, + "ttft_ms_p50": 47.1, + "ttft_ms_p99": 54.4 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 499.0, + "tokens_out": 29921, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 52.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 496.0, + "tokens_out": 29768, + "tokens_in": 0, + "requests_completed": 79, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 49.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 498.3, + "tokens_out": 29901, + "tokens_in": 0, + "requests_completed": 84, + "ttft_ms_p50": 47.0, + "ttft_ms_p99": 52.3 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 495.4, + "tokens_out": 29715, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.7, + "ttft_ms_p99": 50.2 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 496.1, + "tokens_out": 29779, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 53.9 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 503.2, + "tokens_out": 30195, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 47.3, + "ttft_ms_p99": 54.0 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 489.6, + "tokens_out": 29369, + "tokens_in": 0, + "requests_completed": 83, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 52.3 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 504.8, + "tokens_out": 30299, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 52.2 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 500.4, + "tokens_out": 30017, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.7, + "ttft_ms_p99": 50.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 494.6, + "tokens_out": 29670, + "tokens_in": 0, + "requests_completed": 84, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 51.2 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 492.1, + "tokens_out": 29528, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 47.0, + "ttft_ms_p99": 51.5 + } + ], + "sustained_throughput_tokens_per_sec": 494.1, + "throttle_ratio": 0.905, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -320.5 + } + }, + "accuracy": { + "subset_score": 0.58, + "baseline_delta": -0.01, + "valid": true, + "framework": "vLLM", + "precision": "W8A16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:10:48", + "run_id": "5b72ecb7", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_5b72ecb7", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:08:05.948198+00:00", + "benchmark_end_time": "2026-05-18T14:10:48.869711+00:00", + "benchmark_elapsed_minutes": 26.1, + "model_load_seconds": 25.3, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'online', 'sustained'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/offline", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/online", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/sustained" + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/sustained/result.json new file mode 100644 index 0000000..0fa36d8 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a16/sustained/result.json @@ -0,0 +1,279 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", + "model_revision": "38e03ba250017bf8ed3eeecd3a744e21f6b994a9", + "model_name": null, + "model_note": "INT8 weight-only quantization by RedHatAI using llm-compressor. Weights INT8, activations FP16.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A16", + "effective_dtype": null, + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 456.8, + "tokens_out": 27416, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 50.0, + "ttft_ms_p99": 372.0 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 504.0, + "tokens_out": 30242, + "tokens_in": 0, + "requests_completed": 82, + "ttft_ms_p50": 47.1, + "ttft_ms_p99": 71.8 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 486.6, + "tokens_out": 29207, + "tokens_in": 0, + "requests_completed": 83, + "ttft_ms_p50": 47.1, + "ttft_ms_p99": 54.4 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 499.0, + "tokens_out": 29921, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 52.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 496.0, + "tokens_out": 29768, + "tokens_in": 0, + "requests_completed": 79, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 49.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 498.3, + "tokens_out": 29901, + "tokens_in": 0, + "requests_completed": 84, + "ttft_ms_p50": 47.0, + "ttft_ms_p99": 52.3 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 495.4, + "tokens_out": 29715, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.7, + "ttft_ms_p99": 50.2 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 496.1, + "tokens_out": 29779, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 53.9 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 503.2, + "tokens_out": 30195, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 47.3, + "ttft_ms_p99": 54.0 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 489.6, + "tokens_out": 29369, + "tokens_in": 0, + "requests_completed": 83, + "ttft_ms_p50": 46.9, + "ttft_ms_p99": 52.3 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 504.8, + "tokens_out": 30299, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 52.2 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 500.4, + "tokens_out": 30017, + "tokens_in": 0, + "requests_completed": 85, + "ttft_ms_p50": 46.7, + "ttft_ms_p99": 50.1 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 494.6, + "tokens_out": 29670, + "tokens_in": 0, + "requests_completed": 84, + "ttft_ms_p50": 46.8, + "ttft_ms_p99": 51.2 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 492.1, + "tokens_out": 29528, + "tokens_in": 0, + "requests_completed": 81, + "ttft_ms_p50": 47.0, + "ttft_ms_p99": 51.5 + } + ], + "sustained_throughput_tokens_per_sec": 494.1, + "throttle_ratio": 0.905, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -320.5 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:36:58", + "run_id": "5b72ecb7", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_5b72ecb7", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T14:21:44.687162+00:00", + "benchmark_end_time": "2026-05-18T14:36:58.078243+00:00", + "benchmark_elapsed_minutes": 15.2, + "model_load_seconds": 75.8 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/accuracy/accuracy.json new file mode 100644 index 0000000..a4847b6 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.59, + "baseline_delta": 0.0, + "valid": true, + "framework": "vLLM", + "precision": "W8A8", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/offline/result.json new file mode 100644 index 0000000..01e7a3d --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/offline/result.json @@ -0,0 +1,183 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", + "model_revision": "e2bfb7d92784ad7d1b606c2f9644d3cefb2ec708", + "model_name": null, + "model_note": "INT8 quantized by RedHatAI using llm-compressor (compressed-tensors). Both weights and activations quantized to INT8.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A8", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3208.11, + "throughput_tokens_per_sec_per_chip": 3208.11, + "throughput_tokens_per_sec_total": 5840.36, + "elapsed_seconds_median": 10.7, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3140.16, + "throughput_tokens_per_sec_per_chip": 3140.16, + "throughput_tokens_per_sec_total": 5706.63, + "elapsed_seconds_median": 11.0, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3193.23, + "throughput_tokens_per_sec_per_chip": 3193.23, + "throughput_tokens_per_sec_total": 5813.28, + "elapsed_seconds_median": 10.7, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3175.58, + "throughput_tokens_per_sec_per_chip": 3175.58, + "throughput_tokens_per_sec_total": 5786.77, + "elapsed_seconds_median": 10.8, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "13:39:44", + "run_id": "1b79437b", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_1b79437b", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T13:36:50.535504+00:00", + "benchmark_end_time": "2026-05-18T13:39:44.822889+00:00", + "benchmark_elapsed_minutes": 2.9, + "model_load_seconds": 18.2 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/online/result.json new file mode 100644 index 0000000..5e5bd00 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/online/result.json @@ -0,0 +1,181 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", + "model_revision": "e2bfb7d92784ad7d1b606c2f9644d3cefb2ec708", + "model_name": null, + "model_note": "INT8 quantized by RedHatAI using llm-compressor (compressed-tensors). Both weights and activations quantized to INT8.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A8", + "effective_dtype": null, + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 55.34, + "ttft_ms_p90": 63.29, + "ttft_ms_p99": 69.75, + "tpot_ms_p50": 20.67, + "tpot_ms_p90": 20.88, + "tpot_ms_p99": 21.3, + "elapsed_seconds_median": 72.9, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.42, + "ttft_ms_p90": 66.6, + "ttft_ms_p99": 69.92, + "tpot_ms_p50": 21.28, + "tpot_ms_p90": 22.19, + "tpot_ms_p99": 22.28, + "elapsed_seconds_median": 39.6, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 74.55, + "ttft_ms_p90": 4438.81, + "ttft_ms_p99": 5421.82, + "tpot_ms_p50": 22.53, + "tpot_ms_p90": 23.69, + "tpot_ms_p99": 25.14, + "elapsed_seconds_median": 27.7, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 985.11, + "ttft_ms_p90": 8331.22, + "ttft_ms_p99": 8868.55, + "tpot_ms_p50": 23.38, + "tpot_ms_p90": 24.38, + "tpot_ms_p99": 26.79, + "elapsed_seconds_median": 25.6, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "13:48:53", + "run_id": "1b79437b", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_1b79437b", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T13:40:36.652448+00:00", + "benchmark_end_time": "2026-05-18T13:48:53.173908+00:00", + "benchmark_elapsed_minutes": 8.3, + "model_load_seconds": 23.6 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/result.json new file mode 100644 index 0000000..ff6e59c --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/result.json @@ -0,0 +1,400 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", + "model_revision": "e2bfb7d92784ad7d1b606c2f9644d3cefb2ec708", + "model_name": null, + "model_note": "INT8 quantized by RedHatAI using llm-compressor (compressed-tensors). Both weights and activations quantized to INT8.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A8", + "effective_dtype": "bfloat16", + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "online", + "sustained" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + } + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 3208.11, + "throughput_tokens_per_sec_per_chip": 3208.11, + "throughput_tokens_per_sec_total": 5840.36, + "elapsed_seconds_median": 10.7, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 3140.16, + "throughput_tokens_per_sec_per_chip": 3140.16, + "throughput_tokens_per_sec_total": 5706.63, + "elapsed_seconds_median": 11.0, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 3193.23, + "throughput_tokens_per_sec_per_chip": 3193.23, + "throughput_tokens_per_sec_total": 5813.28, + "elapsed_seconds_median": 10.7, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 3175.58, + "throughput_tokens_per_sec_per_chip": 3175.58, + "throughput_tokens_per_sec_total": 5786.77, + "elapsed_seconds_median": 10.8, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 10, + "results_by_qps": [ + { + "target_qps": 5, + "achieved_qps": 5.0, + "ttft_ms_p50": 55.34, + "ttft_ms_p90": 63.29, + "ttft_ms_p99": 69.75, + "tpot_ms_p50": 20.67, + "tpot_ms_p90": 20.88, + "tpot_ms_p99": 21.3, + "elapsed_seconds_median": 72.9, + "sla_met": true + }, + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 57.42, + "ttft_ms_p90": 66.6, + "ttft_ms_p99": 69.92, + "tpot_ms_p50": 21.28, + "tpot_ms_p90": 22.19, + "tpot_ms_p99": 22.28, + "elapsed_seconds_median": 39.6, + "sla_met": true + }, + { + "target_qps": 25, + "achieved_qps": 25.0, + "ttft_ms_p50": 74.55, + "ttft_ms_p90": 4438.81, + "ttft_ms_p99": 5421.82, + "tpot_ms_p50": 22.53, + "tpot_ms_p90": 23.69, + "tpot_ms_p99": 25.14, + "elapsed_seconds_median": 27.7, + "sla_met": false + }, + { + "target_qps": 50, + "achieved_qps": 50.0, + "ttft_ms_p50": 985.11, + "ttft_ms_p90": 8331.22, + "ttft_ms_p99": 8868.55, + "tpot_ms_p50": 23.38, + "tpot_ms_p90": 24.38, + "tpot_ms_p99": 26.79, + "elapsed_seconds_median": 25.6, + "sla_met": false + } + ] + }, + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 366.9, + "tokens_out": 22031, + "tokens_in": 0, + "requests_completed": 63, + "ttft_ms_p50": 59.1, + "ttft_ms_p99": 396.4 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 402.7, + "tokens_out": 24148, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.8, + "ttft_ms_p99": 63.4 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 400.8, + "tokens_out": 24050, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.5, + "ttft_ms_p99": 61.0 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 402.1, + "tokens_out": 24127, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.0, + "ttft_ms_p99": 61.2 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 395.3, + "tokens_out": 23722, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.7, + "ttft_ms_p99": 64.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 398.4, + "tokens_out": 23893, + "tokens_in": 0, + "requests_completed": 64, + "ttft_ms_p50": 58.7, + "ttft_ms_p99": 63.0 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 410.0, + "tokens_out": 24605, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.3, + "ttft_ms_p99": 62.1 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 398.6, + "tokens_out": 23918, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 59.5, + "ttft_ms_p99": 65.7 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 390.3, + "tokens_out": 23418, + "tokens_in": 0, + "requests_completed": 64, + "ttft_ms_p50": 58.3, + "ttft_ms_p99": 62.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 417.2, + "tokens_out": 25045, + "tokens_in": 0, + "requests_completed": 69, + "ttft_ms_p50": 58.2, + "ttft_ms_p99": 61.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 391.1, + "tokens_out": 23462, + "tokens_in": 0, + "requests_completed": 70, + "ttft_ms_p50": 57.8, + "ttft_ms_p99": 62.4 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 408.7, + "tokens_out": 24514, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.5, + "ttft_ms_p99": 65.6 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 415.2, + "tokens_out": 24925, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.2, + "ttft_ms_p99": 61.3 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 395.0, + "tokens_out": 23687, + "tokens_in": 0, + "requests_completed": 67, + "ttft_ms_p50": 58.0, + "ttft_ms_p99": 64.9 + } + ], + "sustained_throughput_tokens_per_sec": 399.4, + "throttle_ratio": 0.879, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -331.5 + } + }, + "accuracy": { + "subset_score": 0.59, + "baseline_delta": 0.0, + "valid": true, + "framework": "vLLM", + "precision": "W8A8", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "13:39:44", + "run_id": "1b79437b", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_1b79437b", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T13:36:50.535504+00:00", + "benchmark_end_time": "2026-05-18T13:39:44.822889+00:00", + "benchmark_elapsed_minutes": 26.5, + "model_load_seconds": 18.2, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'online', 'sustained'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/offline", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/online", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/sustained" + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/sustained/result.json new file mode 100644 index 0000000..eaea3a8 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_ffd81462/w8a8/sustained/result.json @@ -0,0 +1,279 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_C", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T05:56:25.789998+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", + "model_revision": "e2bfb7d92784ad7d1b606c2f9644d3cefb2ec708", + "model_name": null, + "model_note": "INT8 quantized by RedHatAI using llm-compressor (compressed-tensors). Both weights and activations quantized to INT8.", + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "W8A8", + "effective_dtype": null, + "quantization_method": "compressed-tensors", + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": true, + "max_num_seqs": 512, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 366.9, + "tokens_out": 22031, + "tokens_in": 0, + "requests_completed": 63, + "ttft_ms_p50": 59.1, + "ttft_ms_p99": 396.4 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 402.7, + "tokens_out": 24148, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.8, + "ttft_ms_p99": 63.4 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 400.8, + "tokens_out": 24050, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.5, + "ttft_ms_p99": 61.0 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 402.1, + "tokens_out": 24127, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.0, + "ttft_ms_p99": 61.2 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 395.3, + "tokens_out": 23722, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.7, + "ttft_ms_p99": 64.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 398.4, + "tokens_out": 23893, + "tokens_in": 0, + "requests_completed": 64, + "ttft_ms_p50": 58.7, + "ttft_ms_p99": 63.0 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 410.0, + "tokens_out": 24605, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.3, + "ttft_ms_p99": 62.1 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 398.6, + "tokens_out": 23918, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 59.5, + "ttft_ms_p99": 65.7 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 390.3, + "tokens_out": 23418, + "tokens_in": 0, + "requests_completed": 64, + "ttft_ms_p50": 58.3, + "ttft_ms_p99": 62.8 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 417.2, + "tokens_out": 25045, + "tokens_in": 0, + "requests_completed": 69, + "ttft_ms_p50": 58.2, + "ttft_ms_p99": 61.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 391.1, + "tokens_out": 23462, + "tokens_in": 0, + "requests_completed": 70, + "ttft_ms_p50": 57.8, + "ttft_ms_p99": 62.4 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 408.7, + "tokens_out": 24514, + "tokens_in": 0, + "requests_completed": 66, + "ttft_ms_p50": 58.5, + "ttft_ms_p99": 65.6 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 415.2, + "tokens_out": 24925, + "tokens_in": 0, + "requests_completed": 71, + "ttft_ms_p50": 58.2, + "ttft_ms_p99": 61.3 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 395.0, + "tokens_out": 23687, + "tokens_in": 0, + "requests_completed": 67, + "ttft_ms_p50": 58.0, + "ttft_ms_p99": 64.9 + } + ], + "sustained_throughput_tokens_per_sec": 399.4, + "throttle_ratio": 0.879, + "throttle_onset_minute": 1.0, + "ttft_p99_drift_ms": -331.5 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "14:04:58", + "run_id": "1b79437b", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_C_nvidia_vllm020_0f6c56e4_1b79437b", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T13:49:43.407380+00:00", + "benchmark_end_time": "2026-05-18T14:04:58.852879+00:00", + "benchmark_elapsed_minutes": 15.3, + "model_load_seconds": 22.2 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/accuracy/accuracy.json new file mode 100644 index 0000000..95fced5 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.56, + "baseline_delta": 0.0, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/env_info.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/env_info.json new file mode 100644 index 0000000..327ddbe --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/env_info.json @@ -0,0 +1,49 @@ +{ + "collected_at": "2026-05-18T07:00:53.162228+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/interactive/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/interactive/result.json new file mode 100644 index 0000000..40a91c4 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/interactive/result.json @@ -0,0 +1,132 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_D", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T07:00:53.162228+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "interactive", + "num_runs": 2, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "interactive": { + "ttft_ms_p50": 3178.55, + "ttft_ms_p90": 3335.27, + "ttft_ms_p99": 3376.37, + "tpot_ms_p50": 13.1, + "tpot_ms_p90": 13.17, + "tpot_ms_p99": 13.2, + "peak_memory_gb": null, + "elapsed_seconds_median": 651.9 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "07:46:17", + "run_id": "43e96189", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T07:24:33.202740+00:00", + "benchmark_end_time": "2026-05-18T07:46:17.004385+00:00", + "benchmark_elapsed_minutes": 21.7, + "model_load_seconds": 65.2 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/offline/result.json new file mode 100644 index 0000000..0778b9c --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/offline/result.json @@ -0,0 +1,152 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_D", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T07:00:53.162228+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 2, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 65.15, + "throughput_tokens_per_sec_per_chip": 65.15, + "throughput_tokens_per_sec_total": 7353.82, + "elapsed_seconds_median": 196.5, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 65.12, + "throughput_tokens_per_sec_per_chip": 65.12, + "throughput_tokens_per_sec_total": 7349.93, + "elapsed_seconds_median": 196.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "07:23:02", + "run_id": "43e96189", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T07:03:23.326807+00:00", + "benchmark_end_time": "2026-05-18T07:23:02.065079+00:00", + "benchmark_elapsed_minutes": 19.6, + "model_load_seconds": 41.7 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/online/result.json new file mode 100644 index 0000000..fd6b20e --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/online/result.json @@ -0,0 +1,169 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_D", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T07:00:53.162228+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 2, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": false, + "max_num_seqs": 64, + "gpu_memory_utilization": 0.85 + }, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 5000, + "max_valid_qps": 0.0, + "results_by_qps": [ + { + "target_qps": 0.5, + "achieved_qps": 0.5, + "ttft_ms_p50": 129241.71, + "ttft_ms_p90": 238515.99, + "ttft_ms_p99": 255266.98, + "tpot_ms_p50": 231.96, + "tpot_ms_p90": 236.23, + "tpot_ms_p99": 238.5, + "elapsed_seconds_median": 459.2, + "sla_met": false + }, + { + "target_qps": 1, + "achieved_qps": 1.0, + "ttft_ms_p50": 163924.47, + "ttft_ms_p90": 304663.59, + "ttft_ms_p99": 340432.59, + "tpot_ms_p50": 232.21, + "tpot_ms_p90": 236.44, + "tpot_ms_p99": 238.73, + "elapsed_seconds_median": 461.6, + "sla_met": false + }, + { + "target_qps": 2, + "achieved_qps": 2.0, + "ttft_ms_p50": 197613.68, + "ttft_ms_p90": 361816.51, + "ttft_ms_p99": 400408.53, + "tpot_ms_p50": 232.17, + "tpot_ms_p90": 236.5, + "tpot_ms_p99": 238.76, + "elapsed_seconds_median": 459.2, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "10:05:01", + "run_id": "43e96189", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T09:19:01.620038+00:00", + "benchmark_end_time": "2026-05-18T10:05:01.503808+00:00", + "benchmark_elapsed_minutes": 46.0, + "model_load_seconds": 51.9 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/result.json new file mode 100644 index 0000000..ac745b1 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/result.json @@ -0,0 +1,519 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_D", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T07:00:53.162228+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "interactive", + "sustained", + "online" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 2, + "extra_config": null + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 1, + "throughput_tokens_per_sec": 65.15, + "throughput_tokens_per_sec_per_chip": 65.15, + "throughput_tokens_per_sec_total": 7353.82, + "elapsed_seconds_median": 196.5, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 65.12, + "throughput_tokens_per_sec_per_chip": 65.12, + "throughput_tokens_per_sec_total": 7349.93, + "elapsed_seconds_median": 196.6, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "interactive": { + "ttft_ms_p50": 3178.55, + "ttft_ms_p90": 3335.27, + "ttft_ms_p99": 3376.37, + "tpot_ms_p50": 13.1, + "tpot_ms_p90": 13.17, + "tpot_ms_p99": 13.2, + "peak_memory_gb": null, + "elapsed_seconds_median": 651.9 + }, + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 30, + "warmup_minutes": 2, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": true, + "throughput_tokens_per_sec": 34.1, + "tokens_out": 2048, + "tokens_in": 0, + "requests_completed": 8, + "ttft_ms_p50": 15012.6, + "ttft_ms_p99": 27511.4 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.4, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 6118.2, + "ttft_ms_p99": 6481.6 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5323.4, + "ttft_ms_p99": 6114.9 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5619.9, + "ttft_ms_p99": 6149.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5341.4, + "ttft_ms_p99": 6110.4 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5932.5, + "ttft_ms_p99": 6440.4 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 4964.1, + "ttft_ms_p99": 5812.6 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.8, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5597.6, + "ttft_ms_p99": 6251.9 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5486.4, + "ttft_ms_p99": 6180.1 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5472.0, + "ttft_ms_p99": 6505.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5850.6, + "ttft_ms_p99": 6694.6 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5208.5, + "ttft_ms_p99": 5840.7 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5909.3, + "ttft_ms_p99": 6251.2 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5593.9, + "ttft_ms_p99": 6073.0 + }, + { + "minute": 15.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5297.1, + "ttft_ms_p99": 6684.1 + }, + { + "minute": 16.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5956.8, + "ttft_ms_p99": 6615.6 + }, + { + "minute": 17.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.4, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5954.7, + "ttft_ms_p99": 6462.3 + }, + { + "minute": 18.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5267.7, + "ttft_ms_p99": 6152.2 + }, + { + "minute": 19.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5455.5, + "ttft_ms_p99": 5958.6 + }, + { + "minute": 20.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5614.7, + "ttft_ms_p99": 6275.4 + }, + { + "minute": 21.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5592.8, + "ttft_ms_p99": 6443.6 + }, + { + "minute": 22.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5407.0, + "ttft_ms_p99": 6248.9 + }, + { + "minute": 23.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5348.3, + "ttft_ms_p99": 5840.6 + }, + { + "minute": 24.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.4, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5893.0, + "ttft_ms_p99": 6513.5 + }, + { + "minute": 25.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.8, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 4939.8, + "ttft_ms_p99": 5825.9 + }, + { + "minute": 26.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 4900.3, + "ttft_ms_p99": 6665.7 + }, + { + "minute": 27.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5623.0, + "ttft_ms_p99": 6163.1 + }, + { + "minute": 28.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5881.5, + "ttft_ms_p99": 6217.3 + }, + { + "minute": 29.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 6084.9, + "ttft_ms_p99": 6683.6 + } + ], + "sustained_throughput_tokens_per_sec": 58.7, + "throttle_ratio": 0.866, + "throttle_onset_minute": 2.0, + "ttft_p99_drift_ms": 202.0 + }, + "online": { + "sla_ttft_ms": 5000, + "max_valid_qps": 0.0, + "results_by_qps": [ + { + "target_qps": 0.5, + "achieved_qps": 0.5, + "ttft_ms_p50": 129241.71, + "ttft_ms_p90": 238515.99, + "ttft_ms_p99": 255266.98, + "tpot_ms_p50": 231.96, + "tpot_ms_p90": 236.23, + "tpot_ms_p99": 238.5, + "elapsed_seconds_median": 459.2, + "sla_met": false + }, + { + "target_qps": 1, + "achieved_qps": 1.0, + "ttft_ms_p50": 163924.47, + "ttft_ms_p90": 304663.59, + "ttft_ms_p99": 340432.59, + "tpot_ms_p50": 232.21, + "tpot_ms_p90": 236.44, + "tpot_ms_p99": 238.73, + "elapsed_seconds_median": 461.6, + "sla_met": false + }, + { + "target_qps": 2, + "achieved_qps": 2.0, + "ttft_ms_p50": 197613.68, + "ttft_ms_p90": 361816.51, + "ttft_ms_p99": 400408.53, + "tpot_ms_p50": 232.17, + "tpot_ms_p90": 236.5, + "tpot_ms_p99": 238.76, + "elapsed_seconds_median": 459.2, + "sla_met": false + } + ] + } + }, + "accuracy": { + "subset_score": 0.56, + "baseline_delta": 0.0, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "07:23:02", + "run_id": "43e96189", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T07:03:23.326807+00:00", + "benchmark_end_time": "2026-05-18T07:23:02.065079+00:00", + "benchmark_elapsed_minutes": 118.1, + "model_load_seconds": 41.7, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'interactive', 'sustained', 'online'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/offline", + "interactive": "results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/interactive", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/sustained", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/online" + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/sustained/result.json new file mode 100644 index 0000000..097e0e9 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189/sustained/result.json @@ -0,0 +1,424 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_D", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T07:00:53.162228+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "meta-llama/Llama-3.1-8B-Instruct", + "model_revision": "0e9e39f249a16976918f6564b8830bc894c89659", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 8.0, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 2, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": null, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 8, + "duration_minutes": 30, + "warmup_minutes": 2, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": true, + "throughput_tokens_per_sec": 34.1, + "tokens_out": 2048, + "tokens_in": 0, + "requests_completed": 8, + "ttft_ms_p50": 15012.6, + "ttft_ms_p99": 27511.4 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.4, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 6118.2, + "ttft_ms_p99": 6481.6 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5323.4, + "ttft_ms_p99": 6114.9 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5619.9, + "ttft_ms_p99": 6149.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5341.4, + "ttft_ms_p99": 6110.4 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5932.5, + "ttft_ms_p99": 6440.4 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 4964.1, + "ttft_ms_p99": 5812.6 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.8, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5597.6, + "ttft_ms_p99": 6251.9 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5486.4, + "ttft_ms_p99": 6180.1 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5472.0, + "ttft_ms_p99": 6505.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5850.6, + "ttft_ms_p99": 6694.6 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5208.5, + "ttft_ms_p99": 5840.7 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5909.3, + "ttft_ms_p99": 6251.2 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5593.9, + "ttft_ms_p99": 6073.0 + }, + { + "minute": 15.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5297.1, + "ttft_ms_p99": 6684.1 + }, + { + "minute": 16.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5956.8, + "ttft_ms_p99": 6615.6 + }, + { + "minute": 17.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.4, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5954.7, + "ttft_ms_p99": 6462.3 + }, + { + "minute": 18.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5267.7, + "ttft_ms_p99": 6152.2 + }, + { + "minute": 19.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5455.5, + "ttft_ms_p99": 5958.6 + }, + { + "minute": 20.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5614.7, + "ttft_ms_p99": 6275.4 + }, + { + "minute": 21.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5592.8, + "ttft_ms_p99": 6443.6 + }, + { + "minute": 22.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5407.0, + "ttft_ms_p99": 6248.9 + }, + { + "minute": 23.0, + "is_warmup": false, + "throughput_tokens_per_sec": 64.0, + "tokens_out": 3840, + "tokens_in": 0, + "requests_completed": 15, + "ttft_ms_p50": 5348.3, + "ttft_ms_p99": 5840.6 + }, + { + "minute": 24.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.4, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 5893.0, + "ttft_ms_p99": 6513.5 + }, + { + "minute": 25.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.8, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 4939.8, + "ttft_ms_p99": 5825.9 + }, + { + "minute": 26.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 4900.3, + "ttft_ms_p99": 6665.7 + }, + { + "minute": 27.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5623.0, + "ttft_ms_p99": 6163.1 + }, + { + "minute": 28.0, + "is_warmup": false, + "throughput_tokens_per_sec": 59.7, + "tokens_out": 3584, + "tokens_in": 0, + "requests_completed": 14, + "ttft_ms_p50": 5881.5, + "ttft_ms_p99": 6217.3 + }, + { + "minute": 29.0, + "is_warmup": false, + "throughput_tokens_per_sec": 55.5, + "tokens_out": 3328, + "tokens_in": 0, + "requests_completed": 13, + "ttft_ms_p50": 6084.9, + "ttft_ms_p99": 6683.6 + } + ], + "sustained_throughput_tokens_per_sec": 58.7, + "throttle_ratio": 0.866, + "throttle_onset_minute": 2.0, + "ttft_p99_drift_ms": 202.0 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "08:18:12", + "run_id": "43e96189", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_D_nvidia_vllm020_0f6c56e4_43e96189", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T07:47:24.813895+00:00", + "benchmark_end_time": "2026-05-18T08:18:12.448165+00:00", + "benchmark_elapsed_minutes": 30.8, + "model_load_seconds": 40.8 + } +} diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/accuracy/accuracy.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/accuracy/accuracy.json new file mode 100644 index 0000000..21e4fec --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/accuracy/accuracy.json @@ -0,0 +1,8 @@ +{ + "subset_score": 0.41, + "baseline_delta": 0.03, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/env_info.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/env_info.json new file mode 100644 index 0000000..538f8e4 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/env_info.json @@ -0,0 +1,49 @@ +{ + "collected_at": "2026-05-18T10:05:28.924925+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/interactive/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/interactive/result.json new file mode 100644 index 0000000..f676540 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/interactive/result.json @@ -0,0 +1,137 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_F", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T10:05:28.924925+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "Qwen/Qwen2.5-0.5B-Instruct", + "model_revision": "7ae557604adf67be50417f59c2c2f167def9a775", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 0.5, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "interactive", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": false, + "max_num_seqs": 128, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "interactive": { + "ttft_ms_p50": 11.24, + "ttft_ms_p90": 13.36, + "ttft_ms_p99": 14.74, + "tpot_ms_p50": 1.83, + "tpot_ms_p90": 1.83, + "tpot_ms_p99": 1.87, + "peak_memory_gb": null, + "elapsed_seconds_median": 59.1 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "10:14:20", + "run_id": "a4e6a6e4", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T10:11:23.809781+00:00", + "benchmark_end_time": "2026-05-18T10:14:20.932444+00:00", + "benchmark_elapsed_minutes": 3.0, + "model_load_seconds": 27.1 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/offline/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/offline/result.json new file mode 100644 index 0000000..8c532a4 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/offline/result.json @@ -0,0 +1,170 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_F", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T10:05:28.924925+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "Qwen/Qwen2.5-0.5B-Instruct", + "model_revision": "7ae557604adf67be50417f59c2c2f167def9a775", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 0.5, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "offline", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": false, + "max_num_seqs": 128, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 22462.36, + "throughput_tokens_per_sec_per_chip": 22462.36, + "throughput_tokens_per_sec_total": 33497.61, + "elapsed_seconds_median": 1.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 22493.12, + "throughput_tokens_per_sec_per_chip": 22493.12, + "throughput_tokens_per_sec_total": 33569.32, + "elapsed_seconds_median": 1.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 22884.92, + "throughput_tokens_per_sec_per_chip": 22884.92, + "throughput_tokens_per_sec_total": 34039.23, + "elapsed_seconds_median": 1.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "10:07:40", + "run_id": "a4e6a6e4", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T10:07:16.851531+00:00", + "benchmark_end_time": "2026-05-18T10:07:40.157696+00:00", + "benchmark_elapsed_minutes": 0.4, + "model_load_seconds": 28.0 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/online/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/online/result.json new file mode 100644 index 0000000..7df4890 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/online/result.json @@ -0,0 +1,157 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_F", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T10:05:28.924925+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "Qwen/Qwen2.5-0.5B-Instruct", + "model_revision": "7ae557604adf67be50417f59c2c2f167def9a775", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 0.5, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "online", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": false, + "max_num_seqs": 128, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 40, + "results_by_qps": [ + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 10.04, + "ttft_ms_p90": 12.66, + "ttft_ms_p99": 18.27, + "tpot_ms_p50": 2.13, + "tpot_ms_p90": 2.19, + "tpot_ms_p99": 2.37, + "elapsed_seconds_median": 32.0, + "sla_met": true + }, + { + "target_qps": 40, + "achieved_qps": 40.0, + "ttft_ms_p50": 11.96, + "ttft_ms_p90": 15.96, + "ttft_ms_p99": 19.93, + "tpot_ms_p50": 2.5, + "tpot_ms_p90": 2.65, + "tpot_ms_p99": 2.87, + "elapsed_seconds_median": 7.9, + "sla_met": true + } + ] + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "10:10:30", + "run_id": "a4e6a6e4", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T10:08:31.412469+00:00", + "benchmark_end_time": "2026-05-18T10:10:30.424981+00:00", + "benchmark_elapsed_minutes": 2.0, + "model_load_seconds": 24.1 + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/result.json new file mode 100644 index 0000000..2e7e0ce --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/result.json @@ -0,0 +1,375 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_F", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T10:05:28.924925+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "Qwen/Qwen2.5-0.5B-Instruct", + "model_revision": "7ae557604adf67be50417f59c2c2f167def9a775", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 0.5, + "precision": "BF16", + "effective_dtype": "bfloat16", + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenarios_run": [ + "offline", + "online", + "interactive", + "sustained" + ], + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "num_runs": 3, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": false, + "max_num_seqs": 128, + "gpu_memory_utilization": 0.9 + } + }, + "metrics": { + "derived": {}, + "offline": { + "results_by_concurrency": [ + { + "client_concurrency": 4, + "throughput_tokens_per_sec": 22462.36, + "throughput_tokens_per_sec_per_chip": 22462.36, + "throughput_tokens_per_sec_total": 33497.61, + "elapsed_seconds_median": 1.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 16, + "throughput_tokens_per_sec": 22493.12, + "throughput_tokens_per_sec_per_chip": 22493.12, + "throughput_tokens_per_sec_total": 33569.32, + "elapsed_seconds_median": 1.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + }, + { + "client_concurrency": 64, + "throughput_tokens_per_sec": 22884.92, + "throughput_tokens_per_sec_per_chip": 22884.92, + "throughput_tokens_per_sec_total": 34039.23, + "elapsed_seconds_median": 1.9, + "peak_memory_gb": null, + "power_watts_avg": null, + "power_watts_peak": null, + "oom": false, + "_throughput_note": "output_only", + "_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs." + } + ] + }, + "online": { + "sla_ttft_ms": 500, + "max_valid_qps": 40, + "results_by_qps": [ + { + "target_qps": 10, + "achieved_qps": 10.0, + "ttft_ms_p50": 10.04, + "ttft_ms_p90": 12.66, + "ttft_ms_p99": 18.27, + "tpot_ms_p50": 2.13, + "tpot_ms_p90": 2.19, + "tpot_ms_p99": 2.37, + "elapsed_seconds_median": 32.0, + "sla_met": true + }, + { + "target_qps": 40, + "achieved_qps": 40.0, + "ttft_ms_p50": 11.96, + "ttft_ms_p90": 15.96, + "ttft_ms_p99": 19.93, + "tpot_ms_p50": 2.5, + "tpot_ms_p90": 2.65, + "tpot_ms_p99": 2.87, + "elapsed_seconds_median": 7.9, + "sla_met": true + } + ] + }, + "interactive": { + "ttft_ms_p50": 11.24, + "ttft_ms_p90": 13.36, + "ttft_ms_p99": 14.74, + "tpot_ms_p50": 1.83, + "tpot_ms_p90": 1.83, + "tpot_ms_p99": 1.87, + "peak_memory_gb": null, + "elapsed_seconds_median": 59.1 + }, + "sustained": { + "sustained_concurrency": 32, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11541.5, + "tokens_out": 692740, + "tokens_in": 0, + "requests_completed": 3291, + "ttft_ms_p50": 12.9, + "ttft_ms_p99": 36.1 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11672.6, + "tokens_out": 700107, + "tokens_in": 0, + "requests_completed": 3324, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 20.3 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11721.9, + "tokens_out": 703664, + "tokens_in": 0, + "requests_completed": 3337, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 19.1 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11526.8, + "tokens_out": 691780, + "tokens_in": 0, + "requests_completed": 3289, + "ttft_ms_p50": 13.3, + "ttft_ms_p99": 20.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11228.1, + "tokens_out": 673578, + "tokens_in": 0, + "requests_completed": 3190, + "ttft_ms_p50": 13.8, + "ttft_ms_p99": 21.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11380.9, + "tokens_out": 682977, + "tokens_in": 0, + "requests_completed": 3245, + "ttft_ms_p50": 13.8, + "ttft_ms_p99": 21.0 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11711.7, + "tokens_out": 702201, + "tokens_in": 0, + "requests_completed": 3331, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 20.1 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11643.5, + "tokens_out": 698683, + "tokens_in": 0, + "requests_completed": 3317, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 20.3 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11662.7, + "tokens_out": 700038, + "tokens_in": 0, + "requests_completed": 3323, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 20.2 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11612.6, + "tokens_out": 696555, + "tokens_in": 0, + "requests_completed": 3294, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 19.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11623.8, + "tokens_out": 697495, + "tokens_in": 0, + "requests_completed": 3317, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 19.7 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11639.3, + "tokens_out": 698222, + "tokens_in": 0, + "requests_completed": 3311, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 20.2 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11651.4, + "tokens_out": 699229, + "tokens_in": 0, + "requests_completed": 3321, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 20.7 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11450.2, + "tokens_out": 686752, + "tokens_in": 0, + "requests_completed": 3257, + "ttft_ms_p50": 13.6, + "ttft_ms_p99": 20.7 + } + ], + "sustained_throughput_tokens_per_sec": 11576.2, + "throttle_ratio": 0.958, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -15.4 + } + }, + "accuracy": { + "subset_score": 0.41, + "baseline_delta": 0.03, + "valid": true, + "framework": "vLLM", + "precision": "BF16", + "notes": "Integrated accuracy check \u2014 used same vLLM instance as benchmark." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "10:07:40", + "run_id": "a4e6a6e4", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T10:07:16.851531+00:00", + "benchmark_end_time": "2026-05-18T10:07:40.157696+00:00", + "benchmark_elapsed_minutes": 20.4, + "model_load_seconds": 28.0, + "benchmark_elapsed_minutes_note": "Total across ['offline', 'online', 'interactive', 'sustained'] scenarios.", + "scenario_dirs": { + "offline": "results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/offline", + "online": "results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/online", + "interactive": "results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/interactive", + "sustained": "results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/sustained" + } + } +} \ No newline at end of file diff --git a/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/sustained/result.json b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/sustained/result.json new file mode 100644 index 0000000..6851ff6 --- /dev/null +++ b/results/verified/nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4/sustained/result.json @@ -0,0 +1,279 @@ +{ + "schema_version": "1.0", + "suite_id": "suite_F", + "implementation_id": "nvidia_vllm020_0f6c56e4", + "chip": { + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "count": 1, + "memory_gb": 80.0, + "interconnect_intra_node": null, + "interconnect_inter_node": null + }, + "environment": { + "collected_at": "2026-05-18T10:05:28.924925+00:00", + "accelerators": [ + { + "index": 0, + "name": "NVIDIA A100-SXM4-80GB", + "vendor": "NVIDIA", + "memory_gb": 80.0, + "driver_version": "580.65.06", + "firmware_version": null, + "compute_capability": "8.0", + "supports_bf16": true + } + ], + "accelerator_platform": "nvidia", + "accelerator_topology": "\tGPU0\tNIC0\tNIC1\tNIC2\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\nGPU0\t X \tPXB\tSYS\tSYS\t0-63,128-191\t0\t\tN/A\nNIC0\tPXB\t X \tSYS\tSYS\t\t\t\t\nNIC1\tSYS\tSYS\t X \tPIX\t\t\t\t\nNIC2\tSYS\tSYS\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n\n", + "intra_node_interconnect": null, + "cpu": { + "model": "AMD EPYC 7742 64-Core Processor", + "physical_cores": 128, + "logical_cores": 255, + "numa_nodes": 2 + }, + "system_memory_gb": 1007.7, + "pcie_generation": "PCIe Gen 4", + "cpu_accelerator_bandwidth_gbs": null, + "network_interfaces": [ + { + "name": "mlx5_0", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_1", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + }, + { + "name": "mlx5_2", + "type": "InfiniBand/RoCE", + "bandwidth_gbps": null + } + ], + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0", + "kernel_version": "5.15.0-60-generic", + "runtime_version": "CUDA 13.0", + "pytorch_version": "2.11.0+cu130" + }, + "software": { + "framework": "vLLM", + "framework_version": "0.20.1+transformers-5.8.1", + "driver_version": "580.65.06", + "runtime_version": "CUDA 13.0", + "os": "Ubuntu 22.04.4 LTS", + "python_version": "3.12.0" + }, + "model": { + "model_id": "Qwen/Qwen2.5-0.5B-Instruct", + "model_revision": "7ae557604adf67be50417f59c2c2f167def9a775", + "model_name": null, + "model_note": null, + "model_source": "local", + "architecture": "dense", + "parameter_count_b": 0.5, + "precision": "BF16", + "effective_dtype": null, + "quantization_method": null, + "model_format": "HuggingFace original" + }, + "task": { + "scenario": "sustained", + "num_runs": 3, + "warmup_runs": 1, + "parallelism": { + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "expert_parallel_size": 1, + "data_parallel_size": 1 + }, + "extra_config": { + "tensor_parallel_size": 1, + "enforce_eager": false, + "max_num_seqs": 128, + "gpu_memory_utilization": 0.9 + }, + "runtime_metrics": null + }, + "metrics": { + "sustained": { + "sustained_concurrency": 32, + "duration_minutes": 15, + "warmup_minutes": 1, + "sample_interval_seconds": 60, + "samples": [ + { + "minute": 1.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11541.5, + "tokens_out": 692740, + "tokens_in": 0, + "requests_completed": 3291, + "ttft_ms_p50": 12.9, + "ttft_ms_p99": 36.1 + }, + { + "minute": 2.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11672.6, + "tokens_out": 700107, + "tokens_in": 0, + "requests_completed": 3324, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 20.3 + }, + { + "minute": 3.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11721.9, + "tokens_out": 703664, + "tokens_in": 0, + "requests_completed": 3337, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 19.1 + }, + { + "minute": 4.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11526.8, + "tokens_out": 691780, + "tokens_in": 0, + "requests_completed": 3289, + "ttft_ms_p50": 13.3, + "ttft_ms_p99": 20.6 + }, + { + "minute": 5.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11228.1, + "tokens_out": 673578, + "tokens_in": 0, + "requests_completed": 3190, + "ttft_ms_p50": 13.8, + "ttft_ms_p99": 21.5 + }, + { + "minute": 6.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11380.9, + "tokens_out": 682977, + "tokens_in": 0, + "requests_completed": 3245, + "ttft_ms_p50": 13.8, + "ttft_ms_p99": 21.0 + }, + { + "minute": 7.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11711.7, + "tokens_out": 702201, + "tokens_in": 0, + "requests_completed": 3331, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 20.1 + }, + { + "minute": 8.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11643.5, + "tokens_out": 698683, + "tokens_in": 0, + "requests_completed": 3317, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 20.3 + }, + { + "minute": 9.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11662.7, + "tokens_out": 700038, + "tokens_in": 0, + "requests_completed": 3323, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 20.2 + }, + { + "minute": 10.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11612.6, + "tokens_out": 696555, + "tokens_in": 0, + "requests_completed": 3294, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 19.0 + }, + { + "minute": 11.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11623.8, + "tokens_out": 697495, + "tokens_in": 0, + "requests_completed": 3317, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 19.7 + }, + { + "minute": 12.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11639.3, + "tokens_out": 698222, + "tokens_in": 0, + "requests_completed": 3311, + "ttft_ms_p50": 12.8, + "ttft_ms_p99": 20.2 + }, + { + "minute": 13.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11651.4, + "tokens_out": 699229, + "tokens_in": 0, + "requests_completed": 3321, + "ttft_ms_p50": 12.7, + "ttft_ms_p99": 20.7 + }, + { + "minute": 14.0, + "is_warmup": false, + "throughput_tokens_per_sec": 11450.2, + "tokens_out": 686752, + "tokens_in": 0, + "requests_completed": 3257, + "ttft_ms_p50": 13.6, + "ttft_ms_p99": 20.7 + } + ], + "sustained_throughput_tokens_per_sec": 11576.2, + "throttle_ratio": 0.958, + "throttle_onset_minute": null, + "ttft_p99_drift_ms": -15.4 + } + }, + "accuracy": { + "subset_score": null, + "baseline_delta": null, + "valid": false, + "notes": "Run --scenario accuracy to check model accuracy." + }, + "meta": { + "submitted_by": "JuhaoLiang1997", + "submission_type": "individual", + "date": "2026-05-18", + "time": "10:30:10", + "run_id": "a4e6a6e4", + "run_name": "nvidia_a100_sxm4_80gbx1_suite_F_nvidia_vllm020_0f6c56e4_a4e6a6e4", + "flagged": null, + "reproduce_script": "runners/nvidia_vllm020_0f6c56e4/runner.py", + "env_info_file": "../env_info.json", + "log_file": "run.log", + "samples_file": "samples.jsonl", + "notes": null, + "benchmark_start_time": "2026-05-18T10:15:08.957236+00:00", + "benchmark_end_time": "2026-05-18T10:30:10.126252+00:00", + "benchmark_elapsed_minutes": 15.0, + "model_load_seconds": 21.3 + } +} \ No newline at end of file diff --git a/runners/benchmark_runner.py b/runners/benchmark_runner.py index 30afb70..5b0c274 100644 --- a/runners/benchmark_runner.py +++ b/runners/benchmark_runner.py @@ -561,8 +561,19 @@ def _compute_implementation_id(self) -> str | None: unexpected path or from the base class directly). """ try: - # Get the path of the concrete subclass file (not benchmark_runner.py) - runner_file = Path(inspect.getfile(self.__class__)) + # Resolve runner.py path. Prefer the defining module's __file__ because + # torch may patch inspect.getfile() and break dynamic imports. + runner_file = None + mod = sys.modules.get(self.__class__.__module__) + if mod is not None: + mod_file = getattr(mod, "__file__", None) + if mod_file: + runner_file = Path(mod_file).resolve() + if runner_file is None or runner_file.name != "runner.py": + try: + runner_file = Path(inspect.getfile(self.__class__)).resolve() + except (TypeError, OSError): + return None # The runner must be inside a folder named {platform}_{name}_{hash8} folder = runner_file.parent diff --git a/runners/nvidia_vllm020_0f6c56e4/README.md b/runners/nvidia_vllm020_0f6c56e4/README.md index 581e54f..2d1657f 100644 --- a/runners/nvidia_vllm020_0f6c56e4/README.md +++ b/runners/nvidia_vllm020_0f6c56e4/README.md @@ -105,11 +105,9 @@ cp configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml.example \ Merge priority: CLI flags > suite-specific section > global defaults. -### Suite C — quantization (`enforce_eager` required) +### Suite C — quantization -vLLM 0.20 enables CUDA graphs by default. With `compressed-tensors` checkpoints (FP8, W8A8, W8A16), graphs can produce **repetitive garbage output**: offline throughput looks normal but MMLU accuracy drops to ~0. - -The example config sets this only for Suite C so other suites keep CUDA graphs: +Copy `runner_nvidia_vllm020_0f6c56e4.yaml` from the example and keep the `suite_C` override: ```yaml suites: @@ -117,7 +115,9 @@ suites: enforce_eager: true ``` -CLI override: `--enforce-eager`. Without it, Suite C accuracy results are invalid even if throughput is high. +**`enforce_eager` (required for W8A8 / W8A16 on all GPUs):** vLLM 0.20 + CUDA graphs + `compressed-tensors` can yield repetitive garbage (`-addon-addon-…`) with normal-looking offline throughput. Suite C must set `enforce_eager: true` (or pass `--enforce-eager`). + +**FP8 on Ampere (A100 / A800 / RTX 30xx, compute capability < 8.9):** vLLM 0.20 does **not** run RedHatAI FP8 checkpoints correctly. The engine falls back to weight-only Marlin FP8 (`marlin_utils_fp8` warning in the log) and accuracy stays ~0 even with `enforce_eager: true`. This is a vLLM 0.20 limitation, not an AccelMark bug. On these GPUs, Suite C **W8A8 / W8A16 / BF16** are valid; for FP8 use **H100+** (sm ≥ 8.9) or the [`nvidia_vllm_47f5d58e`](../nvidia_vllm_47f5d58e/) runner on vLLM 0.7.3. ### Optional `engine_kwargs` (0.20) @@ -149,7 +149,9 @@ This runner targets Ampere+ with CUDA 12.8/13.0. For Volta/Turing, use [`nvidia_ ### Suite C accuracy ~0 but offline OK -Enable `enforce_eager` for `suite_C` in the runner config (see above) and re-run the accuracy scenario. +1. Confirm `configs/runner_configs/runner_nvidia_vllm020_0f6c56e4.yaml` exists and has `suites.suite_C.enforce_eager: true`. +2. Re-run accuracy with `--force` (or delete the format’s `accuracy/` folder). +3. If the log shows `Weight-only FP8 compression will be used leveraging the Marlin kernel` on an **A100**, FP8 will stay ~0 on vLLM 0.20 — use W8A8/W8A16 or H100+ for FP8 (see Suite C section above). ## Hardware matrix