Skip to content

speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226)#255

Open
kenahrens wants to merge 1 commit into
antirez:mainfrom
kenahrens:m5max-q2q4-bench
Open

speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226)#255
kenahrens wants to merge 1 commit into
antirez:mainfrom
kenahrens:m5max-q2q4-bench

Conversation

@kenahrens
Copy link
Copy Markdown

Bench data for the q2-q4-imatrix mixed Flash quant on M5 Max 128GB (macOS 26.4.1), 14 frontiers from 2048 to 200000 tokens. Adds speed-bench/m5_max_q2q4_imatrix.csv + the auto-generated _ts.svg.

Addresses #226 (q2 vs q2-q4-imatrix benchmark request — was unanswered) and fills the M5 Max coverage gap between the 65K point in #97 and the 256K point in #143.

Run on build ad0209f with the Metal 4 tensor API path (#15) and the decode-indexer top-k optimization (#169) enabled:

./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 200000 --step-incr 16384 --gen-tokens 128

Comparison vs M5 Max q2-imatrix from #97 (same hardware tier):

ctx q2 decode (#97) q2-q4 decode (this PR) Δ
2K 31.48 34.42 +9.3%
32K 28.93 27.75 -4.1%
65K 26.97 25.75 -4.5%
ctx q2 prefill (#97) q2-q4 prefill (this PR) Δ
2K 372.15 413.85 +11.2%
32K 287.67 374.49 +30.2%
65K 244.73 298.66 +22.0%

q2-q4 is faster than q2 across the board for prefill (Q4 last-6-layers + Metal 4 tensor path), and faster for decode below ~16K, with a small drop (~4%) above 32K where the extra Q4 bandwidth cost shows up. The headline finding: the mixed quant is a net win for interactive use and the cost above 32K is much smaller than the README's suggestion would imply.

KV cache observed at ~13.4 KB/token marginal — 2.78 GB at the 200K frontier. (Worth noting: this is the bench's reported kvcache_bytes column, which is much higher than what #164 saw via RSS on M4 Max. Different measurement window.)

Bench data for the q2-q4-imatrix mixed Flash quant (last 6 expert
layers Q4K, rest IQ2XXS) on M5 Max 128GB, macOS 26.4.1.

Fills the unanswered request in antirez#226 for q2-q4-imatrix benchmark
numbers, and extends published M5 Max coverage past the 65K point
from antirez#97 into the 100K-200K range.

Command: ds4-bench -m ds4flash.gguf --prompt-file
speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 200000
--step-incr 16384 --gen-tokens 128

Build: ad0209f (Metal 4 tensor API + decode-indexer top-k path
from antirez#169 enabled).

Highlights vs M5 Max q2-imatrix from antirez#97 (same hardware tier):
- 2K decode: 34.4 t/s (vs 31.5 t/s, +9%)
- 2K prefill: 413.9 t/s (vs 372.2 t/s, +11%)
- 32K decode: 27.8 t/s (vs 28.9 t/s, -4%)
- 65K decode: 25.8 t/s (vs 27.0 t/s, -4%)

q2-q4 is faster than q2 at low ctx (Q4 layers + Metal 4 win) and
~4% slower above 32K (more bandwidth-bound). Closes antirez#226 with data.
@nikolai-vysotskyi
Copy link
Copy Markdown

I tested it on the M5 Max 128gb RAM. I got about the same results, so I think that it is worth adding. if need to, I can do some more tests and give benchmarks from my side

@STRML
Copy link
Copy Markdown

STRML commented May 27, 2026

Quality eval to pair with the speed numbers

Ran the repo's existing gguf-tools/quality-testing/score_official on 100 official DeepSeek-Flash continuations (24 tokens each, ctx=4096) for both quants on the same M5 Max box used in this PR.

Metric q2-imatrix q2-q4-imatrix Δ
avg NLL 0.3726 0.3460 −7.1%
first-token match 63/100 68/100 +5
greedy LCP (avg tokens) 6.08 7.08 +1.00
per-case wins (q2 / q2-q4 / ties) 31 69 0

q2-q4-imatrix is the better average fit on every aggregate metric and wins 69 of 100 individual prompts. The −7.1% NLL is on the same target token sequence, so it's a strict fit improvement, not a sampling artifact.

Caveat: largest single-case deltas favor q2 (cases 35, 7, 42 by 1.1–2.9 NLL). Per-case variance is higher with n=100 than aggregate stats suggest.

Combined with the prefill+decode numbers already in this PR, q2-q4-imatrix shows tighter agreement with the official model on next-token distributions at this sample size. Worth re-running at n≥500 before drawing strong conclusions — but the direction is consistent with the speed result.

Reproduce:

make -C gguf-tools quality-score
python3 gguf-tools/quality-testing/collect_official.py \
  --prompts gguf-tools/quality-testing/prompts.jsonl \
  --out gguf-tools/quality-testing/data/flash --count 100 --max-tokens 24
./gguf-tools/quality-testing/score_official ./q2-imatrix.gguf data/flash/manifest.tsv /tmp/q2.tsv 4096
./gguf-tools/quality-testing/score_official ./q2-q4-imatrix.gguf data/flash/manifest.tsv /tmp/q2q4.tsv 4096
python3 gguf-tools/quality-testing/compare_scores.py /tmp/q2.tsv /tmp/q2q4.tsv

@nhwaani
Copy link
Copy Markdown

nhwaani commented May 28, 2026

cc @antirez — adding an independent M5 Max 128GB reproduction plus the practical q2 → q2-q4 speed-gain summary for review.

Value summary

From this PR's q2 vs q2-q4 comparison on M5 Max 128GB:

Prefill / prompt ingestion

This is the main win, especially for coding-agent and long-context use where ds4 needs to ingest repo context, tool history, or large prompts.

ctx q2 prefill q2-q4 prefill gain
2K 372.15 413.85 +11.2%
32K 287.67 374.49 +30.2%
65K 244.73 298.66 +22.0%

Decode / generation

Decode is mixed: faster at short context, slightly slower at longer context.

ctx q2 decode q2-q4 decode gain
2K 31.48 34.42 +9.3%
32K 28.93 27.75 -4.1%
65K 26.97 25.75 -4.5%

So the practical result is: q2-q4 looks like a net win for interactive use on M5 Max 128GB — much faster prefill, faster short-context decode, and only a small long-context decode penalty.

Independent reproduction

Setup:

  • MacBook Pro M5 Max, 128GB
  • macOS 26.5
  • Current upstream main: 072bc0f (Revert "Merge PR #264: Add wide-token MoE prefill tiles")
  • Model: DeepSeek-V4-Flash-Layers37-42Q4KExperts-OtherExpertLayersIQ2XXSGateUp-Q2KDown-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-fixed.gguf
  • Command shape:
./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 200000 \
  --step-incr 16384 \
  --gen-tokens 128 \
  --warm-weights \
  --csv /tmp/m5max_q2q4_imatrix_current_main.csv

Result: I can reproduce the PR's curve. Decode is very close across the sweep, and KV bytes match exactly. Prefill differs more in the early/mid rows, likely thermal/run-to-run variance, but converges closely at long context.

ctx PR prefill my prefill Δ PR gen my gen Δ
2048 413.85 456.35 +10.3% 34.42 35.93 +4.4%
18432 405.31 383.86 -5.3% 28.42 29.21 +2.8%
34816 374.49 340.16 -9.2% 27.75 27.64 -0.4%
67584 298.66 289.28 -3.1% 25.75 26.08 +1.3%
100352 248.99 247.81 -0.5% 24.36 24.26 -0.4%
133120 215.12 212.35 -1.3% 22.37 22.39 +0.1%
165888 187.32 186.65 -0.4% 20.72 21.02 +1.4%
198656 165.14 166.92 +1.1% 19.54 20.05 +2.6%
200000 157.02 157.40 +0.2% 19.37 19.97 +3.1%

My final row:

200000,1344,157.40,128,19.97,2776775308

So from a second M5 Max 128GB run: this PR's q2-q4-imatrix speed curve looks reproducible and worth merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants