Skip to content

ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype#25153

Open
avifenesh wants to merge 4 commits into
ggml-org:masterfrom
avifenesh:nvfp4-imatrix-scale-search
Open

ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype#25153
avifenesh wants to merge 4 commits into
ggml-org:masterfrom
avifenesh:nvfp4-imatrix-scale-search

Conversation

@avifenesh

@avifenesh avifenesh commented Jun 30, 2026

Copy link
Copy Markdown

Overview

In short, this PR improves NVFP4 quantization quality by replacing the fixed amax/6
per-sub-block scale with a small scale search that minimizes (optionally imatrix-weighted)
reconstruction error, and it wires up the NVFP4 ftype so llama-quantize can actually
produce NVFP4 GGUFs.

Today quantize_nvfp4 does GGML_UNUSED(quant_weights) and quantizes each 16-element
sub-block with a fixed scale amax/6. The imatrix is ignored, and LLAMA_FTYPE_MOSTLY_NVFP4
is defined but unreferenced - there is no CLI type, no ftype->ggml_type mapping, and no
per-tensor selection - so NVFP4 GGUFs can only come from external/convert tools, not from
llama-quantize.

This change:

  • adds quantize_row_nvfp4_core(x, y, k, quant_weights, search); the existing
    quantize_row_nvfp4_ref becomes core(..., NULL, 0), so there is one encoder, not two.
  • for each sub-block, searches UE4M3 scale codes in [base-search, base+search] around the
    amax/6 baseline and keeps the code with the lowest reconstruction error; with an imatrix
    the error is importance-weighted per input column. The winning E2M1 nibbles are cached
    during the search and reused for packing.
  • wires LLAMA_FTYPE_MOSTLY_NVFP4: CLI type in quantize.cpp, ftype->GGML_TYPE_NVFP4
    mapping, per-tensor selection (2D weights -> NVFP4, output/embed/1D -> Q8_0), and a
    tensor_type_fallback case (ne[0] % 64 != 0 -> Q8_0) so odd-shaped tensors do not abort
    the run.

The output is ordinary NVFP4, so the result runs unmodified on the existing NVFP4 kernels.
With search > 0 and no imatrix the scale search alone already improves quality; an imatrix
adds further on top.

Results below are on Qwen3.5-9B, Gemma-4-26B-A4B, and Qwen3.6-27B (+ MTP), the last being
my own daily-driver model and including MTP speculative-decoding acceptance.

Benchmark Setup

  • Base: upstream/master at c818263f2
  • Patch commit: c73069749
  • Hardware: NVIDIA GeForce RTX 5090 Laptop GPU, 24463 MiB VRAM
  • Driver: 595.71.05
  • Build: CUDA 13.1, build/bin/{llama-quantize,llama-imatrix,llama-perplexity}
  • Quantization: source f16 GGUF -> llama-quantize ... NVFP4
  • imatrix: llama-imatrix over a calibration corpus, 40 chunks
  • Perplexity: llama-perplexity -f wiki.test.raw --chunks <N>
  • Three arms compared:
    • RTN: legacy amax/6 scale (search window 0, no imatrix) - baseline
    • scale-search: scale search, no imatrix
    • scale-search + imatrix: scale search using the imatrix

Results

Lower perplexity is better. All three GGUFs for a given model are identical in size (the
change is in scale selection only, not the format).

Qwen3.5-9B (wikitext-2 test, 60 chunks, imatrix on wiki.train)

arm PPL vs RTN
RTN (amax/6) 8.497 +/- 0.180 -
scale-search 8.247 +/- 0.172 -2.9%
scale-search + imatrix 7.984 +/- 0.165 -6.0%

Gemma-4-26B-A4B (60 chunks, imatrix built on a code corpus)

Cross-domain check: the imatrix was built on code yet improves both eval corpora.

corpus RTN scale-search scale-search + imatrix
wikitext-2 5.888 +/- 0.109 5.827 (-1.0%) 5.675 +/- 0.103 (-3.6%)
code 1.933 +/- 0.024 1.909 (-1.2%) 1.885 +/- 0.023 (-2.5%)

Qwen3.6-27B + MTP (hybrid SSM/attn, daily-driver model; 40 chunks, imatrix on code)

corpus RTN scale-search scale-search + imatrix
wikitext-2 6.123 +/- 0.149 6.128 (+0.1%) 6.167 (+0.7%)
code 1.950 +/- 0.031 1.925 (-1.3%) 1.903 +/- 0.030 (-2.4%)

Here the imatrix was built on code and the wikitext numbers are off-domain: scale-search is
roughly flat on wikitext and helps on code, while the code-built imatrix improves the code
eval (-2.4%) but slightly regresses wikitext (+0.7%). So scale-search is domain-neutral and
safe on its own; the imatrix benefit is largest when the calibration domain matches the
target workload. (Unlike Gemma above, where a code imatrix happened to also help wikitext,
this hybrid model does not generalize the code imatrix to wikitext.)

MTP acceptance, same draft model in both arms (so the comparison isolates the requantized
trunk), code/reasoning prompts, greedy decoding:

arm draft_n accepted accept_rate
RTN (amax/6) 525 343 65.3%
scale-search + imatrix 642 489 76.2%

On the matching (code) domain the better-quantized trunk raises MTP acceptance by ~11 points,
which directly reduces speculative verification passes.

Notes on the scale search

  • Window width NVFP4_SCALE_SEARCH = 12. The optimal code sits within a few codes of
    amax/6, reoriented positive (the non-uniform E2M1 codes {..6,8,12} mean scaling up to clamp
    an outlier and fit the bulk is often optimal). Narrowing to +/-8 measurably regressed
    9B search-only PPL (8.240 -> 8.287), so the margin is kept.
  • An iterative coordinate-descent solve (make_qkx-style: assign nibbles -> weighted-LSQ scale
    -> snap to UE4M3 -> repeat) was prototyped and performed worse than the brute-force window;
    The non-uniform code spacing creates local boundaries that the dense window steps over.

Validation

  • git diff --check
  • cmake --build build --target llama-quantize llama-perplexity test-quant-type-selection -j 8
  • Verified the refactor is byte-identical to the pre-refactor implementation for both the
    scale-search and scale-search+imatrix outputs (so the measured PPL is unchanged by the
    cleanup).
  • A/B baseline: a local build with the search window set to 0 reproduces the legacy amax/6
    reference exactly (this is what the RTN arm above was quantized with).
  • test-quant-type-selection: the pre-existing MXFP4_MOE/IQ snapshot diffs are present on
    master without this change; NVFP4 sections are unaffected.

AI usage disclosure: YES - Claude opus 4.8 xhigh was used to write this code. It assisted with code writing, local code review, benchmarking, cleanup, and preparing the MD-styled parts of this PR, like the tables.

While an LLM wrote the code, I did the design, the research, and the codebase reading, I directed, reviewed and steered during development, designed the benchmarks, reviewed the code more than once and instructed changes, and I'm doing the final signing on this code and PR.
I own this code, and I'm responsible for the output of the tools I use.

quantize_nvfp4 previously did GGML_UNUSED(quant_weights) and used a fixed
amax/6 per-sub-block scale. Add quantize_row_nvfp4_impl: search a window of
UE4M3 scale codes around amax/6, minimizing (optionally imatrix-weighted)
sub-block reconstruction error. Output format unchanged -> runs on stock
NVFP4 kernels. NVFP4_NO_SEARCH=1 forces legacy amax/6 for A/B.

Also wire LLAMA_FTYPE_MOSTLY_NVFP4 (was a dead enum, value 39, unreferenced):
add to llama-quantize CLI table, ftype->ggml_type map, and per-tensor type
selection (2D weights -> NVFP4, output/embed/1D -> Q8_0). llama-quantize can
now emit NVFP4 GGUFs for the first time.

Qwen2.5-0.5B wikitext-2 PPL (100 chunks): RTN 22.03 -> scale-search 19.40
(-11.9%) -> +imatrix 19.37. Gap ~6x error bar.
…orse)

Tried an iterative coordinate-descent scale solve (make_qkx-style) and a
narrower window. Findings, now in the comment:
- iterative CD gets stuck in local minima from NVFP4's non-uniform codes;
  -26% (single-seed) / -5% (multi-seed) vs brute-force window on weight-MSE.
- optimal code offset from amax/6 is empirically [-2,+7]; +/-12 is ample.
- +/-8 regressed real 9B search-only PPL (8.240 -> 8.287), so keep +/-12.
No code change to the search itself; window stays +/-12 brute force.
Review fixes (all 6):
1. llama-quant.cpp: add GGML_TYPE_NVFP4 case to tensor_type_fallback() - 2D
   weights with ne[0] % 64 != 0 now fall back to Q8_0 instead of throwing
   'no tensor type fallback defined' and aborting the whole quantize.
2. fix the false comment that claimed a fallback existed when it did not.
3. drop the NVFP4_NO_SEARCH getenv() A/B flag - was read once per row on the
   hot path and is the wrong mechanism for a library; search width is now a
   plain parameter to the core encoder.
4. (same) env-var removed from production code path.
5. cache the winning sub-block E2M1 nibbles during the scale search instead of
   recomputing best_index_mxfp4 for the chosen scale after the loop.
6. fold quantize_row_nvfp4_ref + the importance-aware path into one
   quantize_row_nvfp4_core(x,y,k,weights,search); ref = core(...,NULL,0). No
   more two divergent copies of the amax/pack encode.

Verified byte-identical NVFP4 output to the pre-refactor commit (search and
search+imatrix GGUFs), so all measured PPL results are unchanged. Net -16 lines.
- shrink the verbose header/justification comments to terse one-liners matching
  neighboring quantize_row_*_impl functions; move the 'why window=12' rationale
  out of the code (belongs in the PR, not a permanent comment).
- remove the explicit amax==0 early-out: base=ue4m3(0)=0 skips the whole search
  window, leaving the zero-initialized best_code/best_idx -> same zero block, one
  code path instead of two.
- match the sibling quantize_q*_K wrapper idiom (src += n_per_row).
Verified byte-identical NVFP4 output (search + search+imatrix) vs prior commit.
@github-actions github-actions Bot added examples ggml changes relating to the ggml tensor library for machine learning labels Jun 30, 2026
@avifenesh avifenesh marked this pull request as ready for review June 30, 2026 02:04
@avifenesh avifenesh requested a review from ggerganov as a code owner June 30, 2026 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant