ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype by avifenesh · Pull Request #25153 · ggml-org/llama.cpp

avifenesh · 2026-06-30T00:59:37Z

Overview

In short, this PR improves NVFP4 quantization quality by replacing the fixed amax/6
per-sub-block scale with a small scale search that minimizes (optionally imatrix-weighted)
reconstruction error, and it wires up the NVFP4 ftype so llama-quantize can actually
produce NVFP4 GGUFs.

Today quantize_nvfp4 does GGML_UNUSED(quant_weights) and quantizes each 16-element
sub-block with a fixed scale amax/6. The imatrix is ignored, and LLAMA_FTYPE_MOSTLY_NVFP4
is defined but unreferenced - there is no CLI type, no ftype->ggml_type mapping, and no
per-tensor selection - so NVFP4 GGUFs can only come from external/convert tools, not from
llama-quantize.

This change:

adds quantize_row_nvfp4_core(x, y, k, quant_weights, search); the existing
quantize_row_nvfp4_ref becomes core(..., NULL, 0), so there is one encoder, not two.
for each sub-block, searches UE4M3 scale codes in [base-search, base+search] around the
amax/6 baseline and keeps the code with the lowest reconstruction error; with an imatrix
the error is importance-weighted per input column. The winning E2M1 nibbles are cached
during the search and reused for packing.
wires LLAMA_FTYPE_MOSTLY_NVFP4: CLI type in quantize.cpp, ftype->GGML_TYPE_NVFP4
mapping, per-tensor selection (2D weights -> NVFP4, output/embed/1D -> Q8_0), and a
tensor_type_fallback case (ne[0] % 64 != 0 -> Q8_0) so odd-shaped tensors do not abort
the run.

The output is ordinary NVFP4, so the result runs unmodified on the existing NVFP4 kernels.
With search > 0 and no imatrix the scale search alone already improves quality; an imatrix
adds further on top.

Results below are on Qwen3.5-9B, Gemma-4-26B-A4B, and Qwen3.6-27B (+ MTP), the last being
my own daily-driver model and including MTP speculative-decoding acceptance.

Benchmark Setup

Base: upstream/master at c818263f2
Patch commit: c73069749
Hardware: NVIDIA GeForce RTX 5090 Laptop GPU, 24463 MiB VRAM
Driver: 595.71.05
Build: CUDA 13.1, build/bin/{llama-quantize,llama-imatrix,llama-perplexity}
Quantization: source f16 GGUF -> llama-quantize ... NVFP4
imatrix: llama-imatrix over a calibration corpus, 40 chunks
Perplexity: llama-perplexity -f wiki.test.raw --chunks <N>
Three arms compared:
- RTN: legacy amax/6 scale (search window 0, no imatrix) - baseline
- scale-search: scale search, no imatrix
- scale-search + imatrix: scale search using the imatrix

Results

Lower perplexity is better. All three GGUFs for a given model are identical in size (the
change is in scale selection only, not the format).

Qwen3.5-9B (wikitext-2 test, 60 chunks, imatrix on wiki.train)

arm	PPL	vs RTN
RTN (amax/6)	8.497 +/- 0.180	-
scale-search	8.247 +/- 0.172	-2.9%
scale-search + imatrix	7.984 +/- 0.165	-6.0%

Gemma-4-26B-A4B (60 chunks, imatrix built on a code corpus)

Cross-domain check: the imatrix was built on code yet improves both eval corpora.

corpus	RTN	scale-search	scale-search + imatrix
wikitext-2	5.888 +/- 0.109	5.827 (-1.0%)	5.675 +/- 0.103 (-3.6%)
code	1.933 +/- 0.024	1.909 (-1.2%)	1.885 +/- 0.023 (-2.5%)

Qwen3.6-27B + MTP (hybrid SSM/attn, daily-driver model; 40 chunks, imatrix on code)

corpus	RTN	scale-search	scale-search + imatrix
wikitext-2	6.123 +/- 0.149	6.128 (+0.1%)	6.167 (+0.7%)
code	1.950 +/- 0.031	1.925 (-1.3%)	1.903 +/- 0.030 (-2.4%)

Here the imatrix was built on code and the wikitext numbers are off-domain: scale-search is
roughly flat on wikitext and helps on code, while the code-built imatrix improves the code
eval (-2.4%) but slightly regresses wikitext (+0.7%). So scale-search is domain-neutral and
safe on its own; the imatrix benefit is largest when the calibration domain matches the
target workload. (Unlike Gemma above, where a code imatrix happened to also help wikitext,
this hybrid model does not generalize the code imatrix to wikitext.)

MTP acceptance, same draft model in both arms (so the comparison isolates the requantized
trunk), code/reasoning prompts, greedy decoding:

arm	draft_n	accepted	accept_rate
RTN (amax/6)	525	343	65.3%
scale-search + imatrix	642	489	76.2%

On the matching (code) domain the better-quantized trunk raises MTP acceptance by ~11 points,
which directly reduces speculative verification passes.

Notes on the scale search

Window width NVFP4_SCALE_SEARCH = 12. The optimal code sits within a few codes of
amax/6, reoriented positive (the non-uniform E2M1 codes {..6,8,12} mean scaling up to clamp
an outlier and fit the bulk is often optimal). Narrowing to +/-8 measurably regressed
9B search-only PPL (8.240 -> 8.287), so the margin is kept.
An iterative coordinate-descent solve (make_qkx-style: assign nibbles -> weighted-LSQ scale
-> snap to UE4M3 -> repeat) was prototyped and performed worse than the brute-force window;
The non-uniform code spacing creates local boundaries that the dense window steps over.

Validation

git diff --check
cmake --build build --target llama-quantize llama-perplexity test-quant-type-selection -j 8
Verified the refactor is byte-identical to the pre-refactor implementation for both the
scale-search and scale-search+imatrix outputs (so the measured PPL is unchanged by the
cleanup).
A/B baseline: a local build with the search window set to 0 reproduces the legacy amax/6
reference exactly (this is what the RTN arm above was quantized with).
test-quant-type-selection: the pre-existing MXFP4_MOE/IQ snapshot diffs are present on
master without this change; NVFP4 sections are unaffected.

AI usage disclosure: YES - Claude opus 4.8 xhigh was used to write this code. It assisted with code writing, local code review, benchmarking, cleanup, and preparing the MD-styled parts of this PR, like the tables.

While an LLM wrote the code, I did the design, the research, and the codebase reading, I directed, reviewed and steered during development, designed the benchmarks, reviewed the code more than once and instructed changes, and I'm doing the final signing on this code and PR.
I own this code, and I'm responsible for the output of the tools I use.

quantize_nvfp4 previously did GGML_UNUSED(quant_weights) and used a fixed amax/6 per-sub-block scale. Add quantize_row_nvfp4_impl: search a window of UE4M3 scale codes around amax/6, minimizing (optionally imatrix-weighted) sub-block reconstruction error. Output format unchanged -> runs on stock NVFP4 kernels. NVFP4_NO_SEARCH=1 forces legacy amax/6 for A/B. Also wire LLAMA_FTYPE_MOSTLY_NVFP4 (was a dead enum, value 39, unreferenced): add to llama-quantize CLI table, ftype->ggml_type map, and per-tensor type selection (2D weights -> NVFP4, output/embed/1D -> Q8_0). llama-quantize can now emit NVFP4 GGUFs for the first time. Qwen2.5-0.5B wikitext-2 PPL (100 chunks): RTN 22.03 -> scale-search 19.40 (-11.9%) -> +imatrix 19.37. Gap ~6x error bar.

…orse) Tried an iterative coordinate-descent scale solve (make_qkx-style) and a narrower window. Findings, now in the comment: - iterative CD gets stuck in local minima from NVFP4's non-uniform codes; -26% (single-seed) / -5% (multi-seed) vs brute-force window on weight-MSE. - optimal code offset from amax/6 is empirically [-2,+7]; +/-12 is ample. - +/-8 regressed real 9B search-only PPL (8.240 -> 8.287), so keep +/-12. No code change to the search itself; window stays +/-12 brute force.

Review fixes (all 6): 1. llama-quant.cpp: add GGML_TYPE_NVFP4 case to tensor_type_fallback() - 2D weights with ne[0] % 64 != 0 now fall back to Q8_0 instead of throwing 'no tensor type fallback defined' and aborting the whole quantize. 2. fix the false comment that claimed a fallback existed when it did not. 3. drop the NVFP4_NO_SEARCH getenv() A/B flag - was read once per row on the hot path and is the wrong mechanism for a library; search width is now a plain parameter to the core encoder. 4. (same) env-var removed from production code path. 5. cache the winning sub-block E2M1 nibbles during the scale search instead of recomputing best_index_mxfp4 for the chosen scale after the loop. 6. fold quantize_row_nvfp4_ref + the importance-aware path into one quantize_row_nvfp4_core(x,y,k,weights,search); ref = core(...,NULL,0). No more two divergent copies of the amax/pack encode. Verified byte-identical NVFP4 output to the pre-refactor commit (search and search+imatrix GGUFs), so all measured PPL results are unchanged. Net -16 lines.

- shrink the verbose header/justification comments to terse one-liners matching neighboring quantize_row_*_impl functions; move the 'why window=12' rationale out of the code (belongs in the PR, not a permanent comment). - remove the explicit amax==0 early-out: base=ue4m3(0)=0 skips the whole search window, leaving the zero-initialized best_code/best_idx -> same zero block, one code path instead of two. - match the sibling quantize_q*_K wrapper idiom (src += n_per_row). Verified byte-identical NVFP4 output (search + search+imatrix) vs prior commit.

avifenesh added 4 commits June 30, 2026 01:51

github-actions Bot added examples ggml changes relating to the ggml tensor library for machine learning labels Jun 30, 2026

avifenesh marked this pull request as ready for review June 30, 2026 02:04

avifenesh requested a review from ggerganov as a code owner June 30, 2026 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype#25153

ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype#25153
avifenesh wants to merge 4 commits into
ggml-org:masterfrom
avifenesh:nvfp4-imatrix-scale-search

avifenesh commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

avifenesh commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Benchmark Setup

Results

Qwen3.5-9B (wikitext-2 test, 60 chunks, imatrix on wiki.train)

Gemma-4-26B-A4B (60 chunks, imatrix built on a code corpus)

Qwen3.6-27B + MTP (hybrid SSM/attn, daily-driver model; 40 chunks, imatrix on code)

Notes on the scale search

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

avifenesh commented Jun 30, 2026 •

edited

Loading