ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype#25153
Open
avifenesh wants to merge 4 commits into
Open
ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype#25153avifenesh wants to merge 4 commits into
avifenesh wants to merge 4 commits into
Conversation
quantize_nvfp4 previously did GGML_UNUSED(quant_weights) and used a fixed amax/6 per-sub-block scale. Add quantize_row_nvfp4_impl: search a window of UE4M3 scale codes around amax/6, minimizing (optionally imatrix-weighted) sub-block reconstruction error. Output format unchanged -> runs on stock NVFP4 kernels. NVFP4_NO_SEARCH=1 forces legacy amax/6 for A/B. Also wire LLAMA_FTYPE_MOSTLY_NVFP4 (was a dead enum, value 39, unreferenced): add to llama-quantize CLI table, ftype->ggml_type map, and per-tensor type selection (2D weights -> NVFP4, output/embed/1D -> Q8_0). llama-quantize can now emit NVFP4 GGUFs for the first time. Qwen2.5-0.5B wikitext-2 PPL (100 chunks): RTN 22.03 -> scale-search 19.40 (-11.9%) -> +imatrix 19.37. Gap ~6x error bar.
…orse) Tried an iterative coordinate-descent scale solve (make_qkx-style) and a narrower window. Findings, now in the comment: - iterative CD gets stuck in local minima from NVFP4's non-uniform codes; -26% (single-seed) / -5% (multi-seed) vs brute-force window on weight-MSE. - optimal code offset from amax/6 is empirically [-2,+7]; +/-12 is ample. - +/-8 regressed real 9B search-only PPL (8.240 -> 8.287), so keep +/-12. No code change to the search itself; window stays +/-12 brute force.
Review fixes (all 6): 1. llama-quant.cpp: add GGML_TYPE_NVFP4 case to tensor_type_fallback() - 2D weights with ne[0] % 64 != 0 now fall back to Q8_0 instead of throwing 'no tensor type fallback defined' and aborting the whole quantize. 2. fix the false comment that claimed a fallback existed when it did not. 3. drop the NVFP4_NO_SEARCH getenv() A/B flag - was read once per row on the hot path and is the wrong mechanism for a library; search width is now a plain parameter to the core encoder. 4. (same) env-var removed from production code path. 5. cache the winning sub-block E2M1 nibbles during the scale search instead of recomputing best_index_mxfp4 for the chosen scale after the loop. 6. fold quantize_row_nvfp4_ref + the importance-aware path into one quantize_row_nvfp4_core(x,y,k,weights,search); ref = core(...,NULL,0). No more two divergent copies of the amax/pack encode. Verified byte-identical NVFP4 output to the pre-refactor commit (search and search+imatrix GGUFs), so all measured PPL results are unchanged. Net -16 lines.
- shrink the verbose header/justification comments to terse one-liners matching neighboring quantize_row_*_impl functions; move the 'why window=12' rationale out of the code (belongs in the PR, not a permanent comment). - remove the explicit amax==0 early-out: base=ue4m3(0)=0 skips the whole search window, leaving the zero-initialized best_code/best_idx -> same zero block, one code path instead of two. - match the sibling quantize_q*_K wrapper idiom (src += n_per_row). Verified byte-identical NVFP4 output (search + search+imatrix) vs prior commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
In short, this PR improves NVFP4 quantization quality by replacing the fixed
amax/6per-sub-block scale with a small scale search that minimizes (optionally imatrix-weighted)
reconstruction error, and it wires up the
NVFP4ftype sollama-quantizecan actuallyproduce NVFP4 GGUFs.
Today
quantize_nvfp4doesGGML_UNUSED(quant_weights)and quantizes each 16-elementsub-block with a fixed scale
amax/6. The imatrix is ignored, andLLAMA_FTYPE_MOSTLY_NVFP4is defined but unreferenced - there is no CLI type, no ftype->ggml_type mapping, and no
per-tensor selection - so NVFP4 GGUFs can only come from external/convert tools, not from
llama-quantize.This change:
quantize_row_nvfp4_core(x, y, k, quant_weights, search); the existingquantize_row_nvfp4_refbecomescore(..., NULL, 0), so there is one encoder, not two.[base-search, base+search]around theamax/6baseline and keeps the code with the lowest reconstruction error; with an imatrixthe error is importance-weighted per input column. The winning E2M1 nibbles are cached
during the search and reused for packing.
LLAMA_FTYPE_MOSTLY_NVFP4: CLI type inquantize.cpp,ftype->GGML_TYPE_NVFP4mapping, per-tensor selection (2D weights -> NVFP4, output/embed/1D -> Q8_0), and a
tensor_type_fallbackcase (ne[0] % 64 != 0-> Q8_0) so odd-shaped tensors do not abortthe run.
The output is ordinary NVFP4, so the result runs unmodified on the existing NVFP4 kernels.
With
search > 0and no imatrix the scale search alone already improves quality; an imatrixadds further on top.
Results below are on Qwen3.5-9B, Gemma-4-26B-A4B, and Qwen3.6-27B (+ MTP), the last being
my own daily-driver model and including MTP speculative-decoding acceptance.
Benchmark Setup
upstream/masteratc818263f2c73069749build/bin/{llama-quantize,llama-imatrix,llama-perplexity}llama-quantize ... NVFP4llama-imatrixover a calibration corpus, 40 chunksllama-perplexity -f wiki.test.raw --chunks <N>RTN: legacyamax/6scale (search window 0, no imatrix) - baselinescale-search: scale search, no imatrixscale-search + imatrix: scale search using the imatrixResults
Lower perplexity is better. All three GGUFs for a given model are identical in size (the
change is in scale selection only, not the format).
Qwen3.5-9B (wikitext-2 test, 60 chunks, imatrix on wiki.train)
Gemma-4-26B-A4B (60 chunks, imatrix built on a code corpus)
Cross-domain check: the imatrix was built on code yet improves both eval corpora.
Qwen3.6-27B + MTP (hybrid SSM/attn, daily-driver model; 40 chunks, imatrix on code)
Here the imatrix was built on code and the wikitext numbers are off-domain: scale-search is
roughly flat on wikitext and helps on code, while the code-built imatrix improves the code
eval (-2.4%) but slightly regresses wikitext (+0.7%). So scale-search is domain-neutral and
safe on its own; the imatrix benefit is largest when the calibration domain matches the
target workload. (Unlike Gemma above, where a code imatrix happened to also help wikitext,
this hybrid model does not generalize the code imatrix to wikitext.)
MTP acceptance, same draft model in both arms (so the comparison isolates the requantized
trunk), code/reasoning prompts, greedy decoding:
On the matching (code) domain the better-quantized trunk raises MTP acceptance by ~11 points,
which directly reduces speculative verification passes.
Notes on the scale search
NVFP4_SCALE_SEARCH = 12. The optimal code sits within a few codes ofamax/6, reoriented positive (the non-uniform E2M1 codes{..6,8,12}mean scaling up to clampan outlier and fit the bulk is often optimal). Narrowing to +/-8 measurably regressed
9B search-only PPL (8.240 -> 8.287), so the margin is kept.
-> snap to UE4M3 -> repeat) was prototyped and performed worse than the brute-force window;
The non-uniform code spacing creates local boundaries that the dense window steps over.
Validation
git diff --checkcmake --build build --target llama-quantize llama-perplexity test-quant-type-selection -j 8scale-search and scale-search+imatrix outputs (so the measured PPL is unchanged by the
cleanup).
amax/6reference exactly (this is what the
RTNarm above was quantized with).test-quant-type-selection: the pre-existing MXFP4_MOE/IQ snapshot diffs are present onmasterwithout this change; NVFP4 sections are unaffected.AI usage disclosure: YES - Claude opus 4.8 xhigh was used to write this code. It assisted with code writing, local code review, benchmarking, cleanup, and preparing the MD-styled parts of this PR, like the tables.
While an LLM wrote the code, I did the design, the research, and the codebase reading, I directed, reviewed and steered during development, designed the benchmarks, reviewed the code more than once and instructed changes, and I'm doing the final signing on this code and PR.
I own this code, and I'm responsible for the output of the tools I use.