FreedomIntelligence · JuhaoLiang1997 · May 19, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.github/ISSUE_TEMPLATE/new_suite.md b/.github/ISSUE_TEMPLATE/new_suite.md
@@ -0,0 +1,75 @@
+---
+name: Propose a new suite
+about: Propose a new benchmark suite (new model, scenario mix, or scaling axis)
+title: "[Suite] <short description, e.g. 'Suite H — Llama-3.1-405B'>"
+labels: suite-proposal
+assignees: ''
+---
+
+<!--
+  This template starts the discussion for a new AccelMark suite. The final
+  contract goes into suites/<suite_id>/suite.json (see
+  schema/suite.schema.json) — please fill in as many of the fields below as
+  you can. Anything you leave blank we'll work out in the thread before
+  merging.
+
+  Full walk-through: DEVELOPMENT.md "Adding a new suite"
+                     https://github.com/JuhaoLiang1997/AccelMark/blob/main/DEVELOPMENT.md
+-->
+
+## Why this suite?
+
+<!-- One sentence: the question this suite answers that no existing suite
+     (A–G) covers. Example: "How fast is this chip on 405B-parameter
+     dense models?" -->
+
+## Suite contract (draft)
+
+| Field | Proposed value |
+|---|---|
+| **Suite ID** | `suite_<X>` |
+| **Model** | `<huggingface/repo-id>` |
+| **Model revision** | `<commit sha or tag>` |
+| **Chip count** | `1` / `auto` / specific number |
+| **Precision** | `BF16` / `FP16` / list of allowed precisions |
+| **Dataset** | existing (`sharegpt_standard_v1`, `sharegpt_edge_v1`, `sharegpt_longctx_v1`) or new |
+| **Max model length** | tokens |
+| **Output tokens (max)** | tokens |
+| **Concurrency levels** | e.g. `[8, 32, 128]` |
+| **Default scenarios** | subset of `accuracy / offline / online / interactive / sustained` |
+| **Extra scenarios** | optional: `sustained / speculative / burst / …` |
+| **Primary metric** | `offline_throughput`, `max_valid_qps`, … |
+| **Expected run time on A100** | minutes |
+
+## Accuracy baseline
+
+<!-- Required before the suite can land on the main leaderboard. -->
+
+- [ ] I will provide an A100 (or equivalent reference) BF16 baseline score
+      to add to `schema/accuracy_baselines.json`.
+- [ ] If a new dataset is required, I will submit it under
+      `datasets/<name>_v1/` with a `README.md` that documents the source
+      and upstream license (see [`datasets/README.md`](../../datasets/README.md)).
+
+## Custom orchestration?
+
+<!-- Most suites only need `suite.json`. Mark these only if you genuinely
+     need a `suite.py` plugin (multiple subprocesses, custom merge logic,
+     similar to Suite C/E). -->
+
+- [ ] Standard scenario dispatch is enough — no `suite.py` needed.
+- [ ] A `suite.py` plugin is required. Reason:
+
+## Reference result plan
+
+<!-- New suites do not appear on the main leaderboard until at least one
+     verified reference result is submitted. -->
+
+- Reference hardware: <e.g. NVIDIA A100-SXM4-80GB ×1>
+- Runner: `<runner_id>`
+- Who will run it: <@your-handle / vendor / community member>
+
+## Open questions
+
+<!-- Anything you'd like community / maintainer feedback on before opening
+     the PR. -->
diff --git a/.github/workflows/generate_leaderboard.yml b/.github/workflows/generate_leaderboard.yml
@@ -11,8 +11,9 @@ on:
     paths:
       - 'results/**'
       - 'leaderboard/**'
+      - 'suites/**'
+      - 'schema/**'
       - 'tools/generate_platforms_matrix.py'
-      - 'schema/platforms.json'
       - 'runners/*/meta.json'
 
   # Allow manual trigger from Actions tab (useful for first deploy or to
@@ -37,6 +38,9 @@ jobs:
       - name: Validate all runner meta.json files and hashes
         run: python runners/validate_runners.py
 
+      - name: Validate all suite definitions
+        run: python runners/validate_suites.py
+
   generate:
     name: Generate and deploy leaderboard
     runs-on: ubuntu-latest

diff --git a/.github/workflows/validate_pr.yml b/.github/workflows/validate_pr.yml
@@ -8,7 +8,8 @@ on:
     paths:
       - 'results/**'
       - 'runners/**'
-      - 'schema/platforms.json'
+      - 'suites/**'
+      - 'schema/**'
       - 'tools/generate_platforms_matrix.py'
       - 'README.md'
       - 'leaderboard/site/**'
@@ -89,6 +90,29 @@ jobs:
           python tools/generate_platforms_matrix.py --check
           echo "::endgroup::"
 
+  validate-suites:
+    name: Validate suite definitions
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+          cache: pip
+
+      - name: Install dependencies
+        run: pip install jsonschema
+
+      # Always validate every suite (and re-validate on schema changes too).
+      # This catches drift introduced by shared changes — e.g. a
+      # suite.schema.json edit that breaks an unrelated existing suite.
+      - name: Validate all suite folders (drift check)
+        run: |
+          echo "::group::Validating every suite folder in the repo"
+          python runners/validate_suites.py
+          echo "::endgroup::"
+
   validate:
     name: Validate result submissions
     runs-on: ubuntu-latest
@@ -225,4 +249,47 @@ jobs:
       # extra files to leaderboard/site/test/ to widen coverage; the
       # glob below picks them up automatically.
       - name: Run leaderboard frontend tests
-        run: node --test leaderboard/site/test/*.test.mjs
+        run: node --test leaderboard/site/test/*.test.mjs
+
+  python-tests:
+    name: Python unit tests (serve + skill)
+    runs-on: ubuntu-latest
+    # Lightweight checks for the FastAPI serve layer and the OpenClaw skill
+    # entry point. No GPU, no real model — everything is mocked. Tests are
+    # opt-in per package so missing deps in one folder don't take the rest
+    # of the suite down with them.
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+          cache: pip
+
+      - name: Install test dependencies
+        # numpy is pulled in transitively by loadgen (imported when serve.server
+        # touches runners.benchmark_runner). Keep this list lean — these are the
+        # only packages required to *collect and run* the unit tests; no torch,
+        # no vendor SDKs, no real runner.
+        run: |
+          pip install --quiet pytest pydantic fastapi httpx pyyaml jsonschema numpy
+
+      - name: Run serve unit tests
+        run: |
+          if [ -d serve/tests ]; then
+            echo "::group::pytest serve/tests"
+            python -m pytest serve/tests -q --no-header --color=no
+            echo "::endgroup::"
+          else
+            echo "serve/tests/ not present — skipping."
+          fi
+
+      - name: Run OpenClaw skill unit tests
+        run: |
+          if [ -d openclaw_skill/tests ]; then
+            echo "::group::pytest openclaw_skill/tests"
+            python -m pytest openclaw_skill/tests -q --no-header --color=no
+            echo "::endgroup::"
+          else
+            echo "openclaw_skill/tests/ not present — skipping."
+          fi
diff --git a/.gitignore b/.gitignore
@@ -12,11 +12,20 @@ env/
 # ── Editor / IDE ────────────────────────────────────────────────────────────
 .idea/
 .vscode/
+.cursor/
 *.swp
 *.swo
 *~
 *.tmp
 .DS_Store
+.aider*
+.envrc
+.direnv/
+
+# ── Node / frontend tooling ─────────────────────────────────────────────────
+node_modules/
+.eslintcache
+npm-debug.log*
 
 # ── Test / lint caches ──────────────────────────────────────────────────────
 .pytest_cache/

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -320,6 +320,21 @@ CI then re-runs the schema validator and the runner-folder integrity check.
 When both pass and a contributor reviews the diff, the PR is merged and your
 result shows up on the leaderboard on the next site build.
 
+### Optional: preview the leaderboard locally
+
+The static site is generated from `results/` by `leaderboard/generate.py`.
+After dropping your result into `results/community/<run_name>/`, you can
+preview the final UI before opening the PR:
+
+```bash
+python leaderboard/generate.py                       # writes leaderboard/site/leaderboard.js + api/
+python -m http.server -d leaderboard/site 8000       # serve the static site
+# open http://localhost:8000
+```
+
+Both `leaderboard.js` and `leaderboard/site/api/` are gitignored — the GitHub
+Actions workflow regenerates them on every merge to `main`.
+
 ### Alternative: open a submission issue (no git required)
 
 If you'd rather not use git, paste your `result.json` into a

diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -32,13 +32,14 @@ AccelMark/
 │   ├── loadgen.py          ← Shared timing and measurement engine
 │   └── types.py            ← InferenceResult, SampleRecord
 ├── suites/
-│   ├── suite_A/suite.json + requests.jsonl
-│   ├── suite_B/suite.json + requests.jsonl
-│   ├── suite_C/suite.json + suite.py + requests.jsonl
-│   ├── suite_D/suite.json + requests.jsonl
-│   ├── suite_E/suite.json + suite.py + requests.jsonl
-│   ├── suite_F/suite.json + requests.jsonl
-│   └── suite_G/suite.json + requests.jsonl
+│   ├── suite_A/suite.json
+│   ├── suite_B/suite.json
+│   ├── suite_C/suite.json + suite.py     ← suite.py is optional; only C and E ship one
+│   ├── suite_D/suite.json
+│   ├── suite_E/suite.json + suite.py
+│   ├── suite_F/suite.json
+│   └── suite_G/suite.json
+│   (request data lives in datasets/, referenced by "dataset" in suite.json)
 ├── datasets/
 │   ├── sharegpt_standard_v1/requests.jsonl  ← 500 prompts, ~280/310 tok
 │   ├── sharegpt_longctx_v1/requests.jsonl   ← 200 prompts, ~28K input tok (Suite D)
@@ -554,12 +555,15 @@ descriptions and distributions.
 If you need a custom distribution:
 
 1. Create `datasets/{your_dataset}_v1/requests.jsonl`
-2. Create `datasets/{your_dataset}_v1/README.md`
+2. Create `datasets/{your_dataset}_v1/README.md` (must document source +
+   upstream license — see `datasets/README.md`)
 3. Set `"dataset": "{your_dataset}_v1"` in your suite.json
 
-If your suite needs a custom dataset only used by that suite, you can
-also place `requests.jsonl` directly in `suites/suite_X/` — the
-benchmark runner checks there as a fallback.
+The `dataset` field is **required** — `BenchmarkRunner._resolve_requests_path`
+loads `datasets/<name>/requests.jsonl` and raises `FileNotFoundError` if it
+cannot find the file. Earlier versions allowed putting `requests.jsonl`
+directly under `suites/suite_X/`; that fallback has been removed in favor
+of the immutable, versioned `datasets/` layout.
 
 Dataset format (one JSON object per line):
 ```json
@@ -622,6 +626,38 @@ not shown on the main leaderboard.
 
 ---
 
+## Adding a new scenario type
+
+If you need a scenario name that none of `accuracy / offline / online /
+interactive / sustained / speculative / burst` covers, you can register
+one without forking the dispatch logic:
+
+1. Open `runners/benchmark_runner.py` and add a row to
+   `_SCENARIO_REGISTRY` near the top of the file:
+
+   ```python
+   "your_scenario": ScenarioSpec(
+       name="your_scenario",
+       inference_kind="streaming",   # or "offline"
+       needs_streaming=True,         # require SUPPORTS_STREAMING?
+       use_async=True,               # passed to load_model()
+       merge_key="your_scenario",    # None = no-merge (e.g. accuracy)
+   ),
+   ```
+
+2. If the scenario needs special LoadGen behaviour (e.g. like `sustained`),
+   add a branch under "Run benchmark" inside `_run_single_scenario`.
+
+3. List the new scenario name in your suite's
+   `scenarios.{default,extra}` array — the merge order is derived from
+   the registry automatically.
+
+Without a registry entry the base class falls back to a streaming
+inference path with `merge_key = <scenario>`. Register an entry whenever
+you want the scenario to be treated differently (offline, no merge, etc.).
+
+---
+
 ## Suite plugin system
 
 Suites with custom orchestration logic (multiple subprocesses, special
@@ -1098,6 +1134,6 @@ python runners/validate_submission.py --dir /tmp/accelmark_test/
 ## Questions and Support
 
 - **Bug in LoadGen or schema:** Open a GitHub Issue
-- **New suite proposal:** Open a GitHub Issue with the "Request new suite" template
+- **New suite proposal:** Open a GitHub Issue with the [**Propose a new suite**](https://github.com/JuhaoLiang1997/AccelMark/issues/new?template=new_suite.md) template
 - **New platform support:** Open a PR with a working platform script and at least one verified result
 - **Leaderboard question:** Check `leaderboard/generate.py` — it's well-commented
diff --git a/NOTICE b/NOTICE
@@ -0,0 +1,71 @@
+AccelMark
+Copyright 2024-2026 Juhao Liang and The AccelMark Contributors
+
+This product includes software developed as part of the AccelMark project
+(https://github.com/JuhaoLiang1997/AccelMark).
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+================================================================================
+Third-party bundled data
+================================================================================
+
+The AccelMark source tree includes a small amount of third-party data so that
+benchmark runs are fully reproducible without network access. Each bundled
+dataset retains its upstream license; the Apache 2.0 license above covers only
+the AccelMark code, schemas, and configuration around it.
+
+--------------------------------------------------------------------------------
+1. datasets/sharegpt_standard_v1/requests.jsonl   (500 prompts)
+   datasets/sharegpt_edge_v1/requests.jsonl       (500 prompts)
+   datasets/sharegpt_longctx_v1/requests.jsonl    (200 prompts)
+--------------------------------------------------------------------------------
+
+  Derived from the ShareGPT GPT-4 conversational dataset curated by:
+
+    shibing624/sharegpt_gpt4
+    https://huggingface.co/datasets/shibing624/sharegpt_gpt4
+    License: CC BY 4.0
+            (https://creativecommons.org/licenses/by/4.0/)
+
+  The upstream corpus was assembled from publicly shared ChatGPT/GPT-4
+  conversations. AccelMark's variants are filtered subsets used as fixed
+  benchmark inputs; no derivation is intended as the authoritative copy.
+
+  Attribution: shibing624/sharegpt_gpt4 contributors, distributed under CC BY 4.0.
+
+  See datasets/<name>/README.md for the per-subset filtering criteria and
+  token statistics.
+
+--------------------------------------------------------------------------------
+2. schema/accuracy_subset.jsonl                   (100 multiple-choice items)
+--------------------------------------------------------------------------------
+
+  A 100-question subset of MMLU (Massive Multitask Language Understanding):
+
+    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D.,
+    & Steinhardt, J. (2021). "Measuring Massive Multitask Language
+    Understanding." International Conference on Learning Representations.
+    https://arxiv.org/abs/2009.03300
+    https://github.com/hendrycks/test
+
+    License: MIT
+            (https://opensource.org/licenses/MIT)
+
+  AccelMark uses this subset purely as an accuracy gate (model-quality
+  sanity check) — it is NOT a measurement of MMLU performance. The subset
+  is immutable; see CONTRIBUTING.md "A few rules".
+
+================================================================================
+Third-party software dependencies
+================================================================================
+
+AccelMark's Python runtime dependencies (jsonschema, numpy, pyyaml, …) and
+the framework backends invoked by each runner (vLLM, SGLang, mlx-lm,
+vllm-ascend, vllm-rocm, vllm-tpu, vllm-musa, …) retain their own licenses.
+See each runner's requirements.txt for pinned versions; see the upstream
+projects for the corresponding license terms.