speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226) by kenahrens · Pull Request #255 · antirez/ds4

kenahrens · 2026-05-26T00:40:55Z

Bench data for the q2-q4-imatrix mixed Flash quant on M5 Max 128GB (macOS 26.4.1), 14 frontiers from 2048 to 200000 tokens. Adds speed-bench/m5_max_q2q4_imatrix.csv + the auto-generated _ts.svg.

Addresses #226 (q2 vs q2-q4-imatrix benchmark request — was unanswered) and fills the M5 Max coverage gap between the 65K point in #97 and the 256K point in #143.

Run on build ad0209f with the Metal 4 tensor API path (#15) and the decode-indexer top-k optimization (#169) enabled:

./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 200000 --step-incr 16384 --gen-tokens 128

Comparison vs M5 Max q2-imatrix from #97 (same hardware tier):

ctx	q2 decode (#97)	q2-q4 decode (this PR)	Δ
2K	31.48	34.42	+9.3%
32K	28.93	27.75	-4.1%
65K	26.97	25.75	-4.5%

ctx	q2 prefill (#97)	q2-q4 prefill (this PR)	Δ
2K	372.15	413.85	+11.2%
32K	287.67	374.49	+30.2%
65K	244.73	298.66	+22.0%

q2-q4 is faster than q2 across the board for prefill (Q4 last-6-layers + Metal 4 tensor path), and faster for decode below ~16K, with a small drop (~4%) above 32K where the extra Q4 bandwidth cost shows up. The headline finding: the mixed quant is a net win for interactive use and the cost above 32K is much smaller than the README's suggestion would imply.

KV cache observed at ~13.4 KB/token marginal — 2.78 GB at the 200K frontier. (Worth noting: this is the bench's reported kvcache_bytes column, which is much higher than what #164 saw via RSS on M4 Max. Different measurement window.)

Bench data for the q2-q4-imatrix mixed Flash quant (last 6 expert layers Q4K, rest IQ2XXS) on M5 Max 128GB, macOS 26.4.1. Fills the unanswered request in antirez#226 for q2-q4-imatrix benchmark numbers, and extends published M5 Max coverage past the 65K point from antirez#97 into the 100K-200K range. Command: ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 200000 --step-incr 16384 --gen-tokens 128 Build: ad0209f (Metal 4 tensor API + decode-indexer top-k path from antirez#169 enabled). Highlights vs M5 Max q2-imatrix from antirez#97 (same hardware tier): - 2K decode: 34.4 t/s (vs 31.5 t/s, +9%) - 2K prefill: 413.9 t/s (vs 372.2 t/s, +11%) - 32K decode: 27.8 t/s (vs 28.9 t/s, -4%) - 65K decode: 25.8 t/s (vs 27.0 t/s, -4%) q2-q4 is faster than q2 at low ctx (Q4 layers + Metal 4 win) and ~4% slower above 32K (more bandwidth-bound). Closes antirez#226 with data.

nikolai-vysotskyi · 2026-05-26T10:30:59Z

I tested it on the M5 Max 128gb RAM. I got about the same results, so I think that it is worth adding. if need to, I can do some more tests and give benchmarks from my side

STRML · 2026-05-27T18:35:36Z

Quality eval to pair with the speed numbers

Ran the repo's existing gguf-tools/quality-testing/score_official on 100 official DeepSeek-Flash continuations (24 tokens each, ctx=4096) for both quants on the same M5 Max box used in this PR.

Metric	q2-imatrix	q2-q4-imatrix	Δ
avg NLL	0.3726	0.3460	−7.1%
first-token match	63/100	68/100	+5
greedy LCP (avg tokens)	6.08	7.08	+1.00
per-case wins (q2 / q2-q4 / ties)	31	69	0

q2-q4-imatrix is the better average fit on every aggregate metric and wins 69 of 100 individual prompts. The −7.1% NLL is on the same target token sequence, so it's a strict fit improvement, not a sampling artifact.

Caveat: largest single-case deltas favor q2 (cases 35, 7, 42 by 1.1–2.9 NLL). Per-case variance is higher with n=100 than aggregate stats suggest.

Combined with the prefill+decode numbers already in this PR, q2-q4-imatrix shows tighter agreement with the official model on next-token distributions at this sample size. Worth re-running at n≥500 before drawing strong conclusions — but the direction is consistent with the speed result.

Reproduce:

make -C gguf-tools quality-score
python3 gguf-tools/quality-testing/collect_official.py \
  --prompts gguf-tools/quality-testing/prompts.jsonl \
  --out gguf-tools/quality-testing/data/flash --count 100 --max-tokens 24
./gguf-tools/quality-testing/score_official ./q2-imatrix.gguf data/flash/manifest.tsv /tmp/q2.tsv 4096
./gguf-tools/quality-testing/score_official ./q2-q4-imatrix.gguf data/flash/manifest.tsv /tmp/q2q4.tsv 4096
python3 gguf-tools/quality-testing/compare_scores.py /tmp/q2.tsv /tmp/q2q4.tsv

nhwaani · 2026-05-28T08:06:14Z

cc @antirez — adding an independent M5 Max 128GB reproduction plus the practical q2 → q2-q4 speed-gain summary for review.

Value summary

From this PR's q2 vs q2-q4 comparison on M5 Max 128GB:

Prefill / prompt ingestion

This is the main win, especially for coding-agent and long-context use where ds4 needs to ingest repo context, tool history, or large prompts.

ctx	q2 prefill	q2-q4 prefill	gain
2K	372.15	413.85	+11.2%
32K	287.67	374.49	+30.2%
65K	244.73	298.66	+22.0%

Decode / generation

Decode is mixed: faster at short context, slightly slower at longer context.

ctx	q2 decode	q2-q4 decode	gain
2K	31.48	34.42	+9.3%
32K	28.93	27.75	-4.1%
65K	26.97	25.75	-4.5%

So the practical result is: q2-q4 looks like a net win for interactive use on M5 Max 128GB — much faster prefill, faster short-context decode, and only a small long-context decode penalty.

Independent reproduction

Setup:

MacBook Pro M5 Max, 128GB
macOS 26.5
Current upstream main: 072bc0f (Revert "Merge PR #264: Add wide-token MoE prefill tiles")
Model: DeepSeek-V4-Flash-Layers37-42Q4KExperts-OtherExpertLayersIQ2XXSGateUp-Q2KDown-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-fixed.gguf
Command shape:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 200000 \
  --step-incr 16384 \
  --gen-tokens 128 \
  --warm-weights \
  --csv /tmp/m5max_q2q4_imatrix_current_main.csv

Result: I can reproduce the PR's curve. Decode is very close across the sweep, and KV bytes match exactly. Prefill differs more in the early/mid rows, likely thermal/run-to-run variance, but converges closely at long context.

ctx	PR prefill	my prefill	Δ	PR gen	my gen	Δ
2048	413.85	456.35	+10.3%	34.42	35.93	+4.4%
18432	405.31	383.86	-5.3%	28.42	29.21	+2.8%
34816	374.49	340.16	-9.2%	27.75	27.64	-0.4%
67584	298.66	289.28	-3.1%	25.75	26.08	+1.3%
100352	248.99	247.81	-0.5%	24.36	24.26	-0.4%
133120	215.12	212.35	-1.3%	22.37	22.39	+0.1%
165888	187.32	186.65	-0.4%	20.72	21.02	+1.4%
198656	165.14	166.92	+1.1%	19.54	20.05	+2.6%
200000	157.02	157.40	+0.2%	19.37	19.97	+3.1%

My final row:

200000,1344,157.40,128,19.97,2776775308

So from a second M5 Max 128GB run: this PR's q2-q4-imatrix speed curve looks reproducible and worth merging.

Flux159 mentioned this pull request May 26, 2026

Supporting Q2 Q4 mixed model download & how to evaluate perf between it & Q2 #226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226)#255

speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226)#255
kenahrens wants to merge 1 commit into
antirez:mainfrom
kenahrens:m5max-q2q4-bench

kenahrens commented May 26, 2026

Uh oh!

nikolai-vysotskyi commented May 26, 2026

Uh oh!

STRML commented May 27, 2026

Uh oh!

nhwaani commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kenahrens commented May 26, 2026

Uh oh!

nikolai-vysotskyi commented May 26, 2026

Uh oh!

STRML commented May 27, 2026

Uh oh!

nhwaani commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Value summary

Prefill / prompt ingestion

Decode / generation

Independent reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nhwaani commented May 28, 2026 •

edited

Loading