Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions .github/ISSUE_TEMPLATE/new_suite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
name: Propose a new suite
about: Propose a new benchmark suite (new model, scenario mix, or scaling axis)
title: "[Suite] <short description, e.g. 'Suite H — Llama-3.1-405B'>"
labels: suite-proposal
assignees: ''
---

<!--
This template starts the discussion for a new AccelMark suite. The final
contract goes into suites/<suite_id>/suite.json (see
schema/suite.schema.json) — please fill in as many of the fields below as
you can. Anything you leave blank we'll work out in the thread before
merging.

Full walk-through: DEVELOPMENT.md "Adding a new suite"
https://github.com/JuhaoLiang1997/AccelMark/blob/main/DEVELOPMENT.md
-->

## Why this suite?

<!-- One sentence: the question this suite answers that no existing suite
(A–G) covers. Example: "How fast is this chip on 405B-parameter
dense models?" -->

## Suite contract (draft)

| Field | Proposed value |
|---|---|
| **Suite ID** | `suite_<X>` |
| **Model** | `<huggingface/repo-id>` |
| **Model revision** | `<commit sha or tag>` |
| **Chip count** | `1` / `auto` / specific number |
| **Precision** | `BF16` / `FP16` / list of allowed precisions |
| **Dataset** | existing (`sharegpt_standard_v1`, `sharegpt_edge_v1`, `sharegpt_longctx_v1`) or new |
| **Max model length** | tokens |
| **Output tokens (max)** | tokens |
| **Concurrency levels** | e.g. `[8, 32, 128]` |
| **Default scenarios** | subset of `accuracy / offline / online / interactive / sustained` |
| **Extra scenarios** | optional: `sustained / speculative / burst / …` |
| **Primary metric** | `offline_throughput`, `max_valid_qps`, … |
| **Expected run time on A100** | minutes |

## Accuracy baseline

<!-- Required before the suite can land on the main leaderboard. -->

- [ ] I will provide an A100 (or equivalent reference) BF16 baseline score
to add to `schema/accuracy_baselines.json`.
- [ ] If a new dataset is required, I will submit it under
`datasets/<name>_v1/` with a `README.md` that documents the source
and upstream license (see [`datasets/README.md`](../../datasets/README.md)).

## Custom orchestration?

<!-- Most suites only need `suite.json`. Mark these only if you genuinely
need a `suite.py` plugin (multiple subprocesses, custom merge logic,
similar to Suite C/E). -->

- [ ] Standard scenario dispatch is enough — no `suite.py` needed.
- [ ] A `suite.py` plugin is required. Reason:

## Reference result plan

<!-- New suites do not appear on the main leaderboard until at least one
verified reference result is submitted. -->

- Reference hardware: <e.g. NVIDIA A100-SXM4-80GB ×1>
- Runner: `<runner_id>`
- Who will run it: <@your-handle / vendor / community member>

## Open questions

<!-- Anything you'd like community / maintainer feedback on before opening
the PR. -->
6 changes: 5 additions & 1 deletion .github/workflows/generate_leaderboard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ on:
paths:
- 'results/**'
- 'leaderboard/**'
- 'suites/**'
- 'schema/**'
- 'tools/generate_platforms_matrix.py'
- 'schema/platforms.json'
- 'runners/*/meta.json'

# Allow manual trigger from Actions tab (useful for first deploy or to
Expand All @@ -37,6 +38,9 @@ jobs:
- name: Validate all runner meta.json files and hashes
run: python runners/validate_runners.py

- name: Validate all suite definitions
run: python runners/validate_suites.py

generate:
name: Generate and deploy leaderboard
runs-on: ubuntu-latest
Expand Down
71 changes: 69 additions & 2 deletions .github/workflows/validate_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ on:
paths:
- 'results/**'
- 'runners/**'
- 'schema/platforms.json'
- 'suites/**'
- 'schema/**'
- 'tools/generate_platforms_matrix.py'
- 'README.md'
- 'leaderboard/site/**'
Expand Down Expand Up @@ -89,6 +90,29 @@ jobs:
python tools/generate_platforms_matrix.py --check
echo "::endgroup::"

validate-suites:
name: Validate suite definitions
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip

- name: Install dependencies
run: pip install jsonschema

# Always validate every suite (and re-validate on schema changes too).
# This catches drift introduced by shared changes — e.g. a
# suite.schema.json edit that breaks an unrelated existing suite.
- name: Validate all suite folders (drift check)
run: |
echo "::group::Validating every suite folder in the repo"
python runners/validate_suites.py
echo "::endgroup::"

validate:
name: Validate result submissions
runs-on: ubuntu-latest
Expand Down Expand Up @@ -225,4 +249,47 @@ jobs:
# extra files to leaderboard/site/test/ to widen coverage; the
# glob below picks them up automatically.
- name: Run leaderboard frontend tests
run: node --test leaderboard/site/test/*.test.mjs
run: node --test leaderboard/site/test/*.test.mjs

python-tests:
name: Python unit tests (serve + skill)
runs-on: ubuntu-latest
# Lightweight checks for the FastAPI serve layer and the OpenClaw skill
# entry point. No GPU, no real model — everything is mocked. Tests are
# opt-in per package so missing deps in one folder don't take the rest
# of the suite down with them.
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip

- name: Install test dependencies
# numpy is pulled in transitively by loadgen (imported when serve.server
# touches runners.benchmark_runner). Keep this list lean — these are the
# only packages required to *collect and run* the unit tests; no torch,
# no vendor SDKs, no real runner.
run: |
pip install --quiet pytest pydantic fastapi httpx pyyaml jsonschema numpy

- name: Run serve unit tests
run: |
if [ -d serve/tests ]; then
echo "::group::pytest serve/tests"
python -m pytest serve/tests -q --no-header --color=no
echo "::endgroup::"
else
echo "serve/tests/ not present — skipping."
fi

- name: Run OpenClaw skill unit tests
run: |
if [ -d openclaw_skill/tests ]; then
echo "::group::pytest openclaw_skill/tests"
python -m pytest openclaw_skill/tests -q --no-header --color=no
echo "::endgroup::"
else
echo "openclaw_skill/tests/ not present — skipping."
fi
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,20 @@ env/
# ── Editor / IDE ────────────────────────────────────────────────────────────
.idea/
.vscode/
.cursor/
*.swp
*.swo
*~
*.tmp
.DS_Store
.aider*
.envrc
.direnv/

# ── Node / frontend tooling ─────────────────────────────────────────────────
node_modules/
.eslintcache
npm-debug.log*

# ── Test / lint caches ──────────────────────────────────────────────────────
.pytest_cache/
Expand Down
15 changes: 15 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,21 @@ CI then re-runs the schema validator and the runner-folder integrity check.
When both pass and a contributor reviews the diff, the PR is merged and your
result shows up on the leaderboard on the next site build.

### Optional: preview the leaderboard locally

The static site is generated from `results/` by `leaderboard/generate.py`.
After dropping your result into `results/community/<run_name>/`, you can
preview the final UI before opening the PR:

```bash
python leaderboard/generate.py # writes leaderboard/site/leaderboard.js + api/
python -m http.server -d leaderboard/site 8000 # serve the static site
# open http://localhost:8000
```

Both `leaderboard.js` and `leaderboard/site/api/` are gitignored — the GitHub
Actions workflow regenerates them on every merge to `main`.

### Alternative: open a submission issue (no git required)

If you'd rather not use git, paste your `result.json` into a
Expand Down
60 changes: 48 additions & 12 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,14 @@ AccelMark/
│ ├── loadgen.py ← Shared timing and measurement engine
│ └── types.py ← InferenceResult, SampleRecord
├── suites/
│ ├── suite_A/suite.json + requests.jsonl
│ ├── suite_B/suite.json + requests.jsonl
│ ├── suite_C/suite.json + suite.py + requests.jsonl
│ ├── suite_D/suite.json + requests.jsonl
│ ├── suite_E/suite.json + suite.py + requests.jsonl
│ ├── suite_F/suite.json + requests.jsonl
│ └── suite_G/suite.json + requests.jsonl
│ ├── suite_A/suite.json
│ ├── suite_B/suite.json
│ ├── suite_C/suite.json + suite.py ← suite.py is optional; only C and E ship one
│ ├── suite_D/suite.json
│ ├── suite_E/suite.json + suite.py
│ ├── suite_F/suite.json
│ └── suite_G/suite.json
│ (request data lives in datasets/, referenced by "dataset" in suite.json)
├── datasets/
│ ├── sharegpt_standard_v1/requests.jsonl ← 500 prompts, ~280/310 tok
│ ├── sharegpt_longctx_v1/requests.jsonl ← 200 prompts, ~28K input tok (Suite D)
Expand Down Expand Up @@ -554,12 +555,15 @@ descriptions and distributions.
If you need a custom distribution:

1. Create `datasets/{your_dataset}_v1/requests.jsonl`
2. Create `datasets/{your_dataset}_v1/README.md`
2. Create `datasets/{your_dataset}_v1/README.md` (must document source +
upstream license — see `datasets/README.md`)
3. Set `"dataset": "{your_dataset}_v1"` in your suite.json

If your suite needs a custom dataset only used by that suite, you can
also place `requests.jsonl` directly in `suites/suite_X/` — the
benchmark runner checks there as a fallback.
The `dataset` field is **required** — `BenchmarkRunner._resolve_requests_path`
loads `datasets/<name>/requests.jsonl` and raises `FileNotFoundError` if it
cannot find the file. Earlier versions allowed putting `requests.jsonl`
directly under `suites/suite_X/`; that fallback has been removed in favor
of the immutable, versioned `datasets/` layout.

Dataset format (one JSON object per line):
```json
Expand Down Expand Up @@ -622,6 +626,38 @@ not shown on the main leaderboard.

---

## Adding a new scenario type

If you need a scenario name that none of `accuracy / offline / online /
interactive / sustained / speculative / burst` covers, you can register
one without forking the dispatch logic:

1. Open `runners/benchmark_runner.py` and add a row to
`_SCENARIO_REGISTRY` near the top of the file:

```python
"your_scenario": ScenarioSpec(
name="your_scenario",
inference_kind="streaming", # or "offline"
needs_streaming=True, # require SUPPORTS_STREAMING?
use_async=True, # passed to load_model()
merge_key="your_scenario", # None = no-merge (e.g. accuracy)
),
```

2. If the scenario needs special LoadGen behaviour (e.g. like `sustained`),
add a branch under "Run benchmark" inside `_run_single_scenario`.

3. List the new scenario name in your suite's
`scenarios.{default,extra}` array — the merge order is derived from
the registry automatically.

Without a registry entry the base class falls back to a streaming
inference path with `merge_key = <scenario>`. Register an entry whenever
you want the scenario to be treated differently (offline, no merge, etc.).

---

## Suite plugin system

Suites with custom orchestration logic (multiple subprocesses, special
Expand Down Expand Up @@ -1098,6 +1134,6 @@ python runners/validate_submission.py --dir /tmp/accelmark_test/
## Questions and Support

- **Bug in LoadGen or schema:** Open a GitHub Issue
- **New suite proposal:** Open a GitHub Issue with the "Request new suite" template
- **New suite proposal:** Open a GitHub Issue with the [**Propose a new suite**](https://github.com/JuhaoLiang1997/AccelMark/issues/new?template=new_suite.md) template
- **New platform support:** Open a PR with a working platform script and at least one verified result
- **Leaderboard question:** Check `leaderboard/generate.py` — it's well-commented
71 changes: 71 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
AccelMark
Copyright 2024-2026 Juhao Liang and The AccelMark Contributors

This product includes software developed as part of the AccelMark project
(https://github.com/JuhaoLiang1997/AccelMark).

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

================================================================================
Third-party bundled data
================================================================================

The AccelMark source tree includes a small amount of third-party data so that
benchmark runs are fully reproducible without network access. Each bundled
dataset retains its upstream license; the Apache 2.0 license above covers only
the AccelMark code, schemas, and configuration around it.

--------------------------------------------------------------------------------
1. datasets/sharegpt_standard_v1/requests.jsonl (500 prompts)
datasets/sharegpt_edge_v1/requests.jsonl (500 prompts)
datasets/sharegpt_longctx_v1/requests.jsonl (200 prompts)
--------------------------------------------------------------------------------

Derived from the ShareGPT GPT-4 conversational dataset curated by:

shibing624/sharegpt_gpt4
https://huggingface.co/datasets/shibing624/sharegpt_gpt4
License: CC BY 4.0
(https://creativecommons.org/licenses/by/4.0/)

The upstream corpus was assembled from publicly shared ChatGPT/GPT-4
conversations. AccelMark's variants are filtered subsets used as fixed
benchmark inputs; no derivation is intended as the authoritative copy.

Attribution: shibing624/sharegpt_gpt4 contributors, distributed under CC BY 4.0.

See datasets/<name>/README.md for the per-subset filtering criteria and
token statistics.

--------------------------------------------------------------------------------
2. schema/accuracy_subset.jsonl (100 multiple-choice items)
--------------------------------------------------------------------------------

A 100-question subset of MMLU (Massive Multitask Language Understanding):

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D.,
& Steinhardt, J. (2021). "Measuring Massive Multitask Language
Understanding." International Conference on Learning Representations.
https://arxiv.org/abs/2009.03300
https://github.com/hendrycks/test

License: MIT
(https://opensource.org/licenses/MIT)

AccelMark uses this subset purely as an accuracy gate (model-quality
sanity check) — it is NOT a measurement of MMLU performance. The subset
is immutable; see CONTRIBUTING.md "A few rules".

================================================================================
Third-party software dependencies
================================================================================

AccelMark's Python runtime dependencies (jsonschema, numpy, pyyaml, …) and
the framework backends invoked by each runner (vLLM, SGLang, mlx-lm,
vllm-ascend, vllm-rocm, vllm-tpu, vllm-musa, …) retain their own licenses.
See each runner's requirements.txt for pinned versions; see the upstream
projects for the corresponding license terms.
Loading
Loading