Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) by Edwardf0t1 · Pull Request #1691 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-06-11T21:29:49Z

What does this PR do?

Type of change: Bug fix

Fixes two sglang deployment failures on multimodal Gemma (gemma-4-31B-it) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards:

NVBug 6293731 — general/ptq/fp8_default-kv_fp8: the w8a8_fp8_fp8 unit enables bare *weight_quantizer / *input_quantizer, which also match the vision tower (model.vision_tower.*, model.visual.*) and the vision embedding projection (model.embed_vision.*). The exported checkpoint deploys but emits garbled text in sglang.
NVBug 6293762 — general/ptq/nvfp4_mlp_only-kv_fp8: the *mlp* enables also match the vision tower's block MLPs (model.vision_tower.encoder.layers.*.mlp), and an image request crashes the FP4 kernel at decode: ValueError: too many values to unpack (expected 2) in sglang's modelopt_quant.py apply.

Fix

Add *embed_vision* / *vision_tower* / *visual* disable rules to the shared configs/ptq/units/default_disabled_quantizers unit, alongside the existing *router* / *lm_head* entries.

Both the composed general/ptq/* recipes and the configs/ptq/presets/model/* presets import this unit, so:

every general recipe (fp8_default, nvfp4_default, nvfp4_mlp_only, nvfp4_omlp_only, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes;
the test_general_ptq_yaml_matches_config_dicts YAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit).

The rules are no-ops on text-only models (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit.

Files changed:

modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml (+14)

Testing

Re-export of gemma-4-31B-it with the affected recipes and re-deploy in sglang (the env from the bug reports: lmsysorg/sglang:v0.5.12.post1, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. (Results to be appended.) Unit-level: tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts (parity) passes for all four general presets.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (text-only checkpoints unaffected; new rules only match vision modules that should never have been quantized by a general recipe)
If you copied code from any other sources or added a new PIP dependency: N/A
Did you write any new necessary tests?: N/A (recipe data fix; covered by the existing parity test + verified by real PTQ export + sglang deploy)
Did you update Changelog?: N/A
Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific w4a8_awq recipe, NVBug 6294017).

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Updated quantization configuration to preserve BF16 precision for vision encoder components in multimodal models.

…anch in sglang (NVBug 6293731, 6293762) The general PTQ presets `fp8_default-kv_fp8` and `nvfp4_mlp_only-kv_fp8` (and their `_cast` KV siblings) enable quantization with broad wildcards that, on multimodal Gemma checkpoints (e.g. gemma-4-31B-it), also match the SigLIP vision tower (`model.vision_tower.*`), the vision embedding projection (`model.embed_vision.*`), and the vision block MLPs: - `fp8_default`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` / `*input_quantizer`, FP8-quantizing the whole vision branch. The exported checkpoint then deploys but emits garbled text in sglang (NVBug 6293731). - `nvfp4_mlp_only`: the `*mlp*` enables match `vision_tower.encoder.layers.*.mlp`, so the FP4 kernel crashes at decode with `ValueError: too many values to unpack (expected 2)` in sglang's modelopt_quant apply path (NVBug 6293762). Add trailing `*visual*` / `*vision_tower*` / `*embed_vision*` disable rules (placed after the enables and `default_disabled_quantizers` so the disable wins), keeping the vision branch in BF16. Mirrors the vision exclusions already shipped in the gemma w4a8_awq / qwen3_5 / nemotron_vl recipes. The rules are no-ops on text-only models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai · 2026-06-11T21:30:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c9f8ae1-09ab-40c2-9240-ae3f30f5b2ec

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2acad and 513862e.

📒 Files selected for processing (1)

modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml

📝 Walkthrough

Walkthrough

This PR extends the PTQ default quantizer disable configuration to explicitly exclude vision and multimodal components from quantization by adding three new pattern-matching rules (*embed_vision*, *vision_tower*, *visual*) with documentation that these components remain in BF16 format unless downstream recipes re-enable them.

Changes

Quantization Recipe Configuration Update

Layer / File(s)	Summary
Vision component exclusion patterns in default quantizer config `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`	Three new quantizer disable entries for `embed_vision`, `vision_tower`, and `visual` patterns are added to the default disabled quantizers configuration, with accompanying comments explaining that vision encoders and multimodal embedding projections remain in BF16 by default.

Possibly related PRs

NVIDIA/Model-Optimizer#1687: Both PRs modify the same PTQ quantizer-disabling YAML configuration to add rules keeping vision and multimodal components unquantized for NVFP4.

Suggested reviewers

shengliangxu

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: excluding multimodal vision branches from quantization in PTQ recipes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	git diff vs origin/main shows no Python changes under modelopt/ or examples/ (only YAML plus non-scope test/tool files). No SECURITY.md anti-patterns can be introduced in-scope Python.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-gemma4-fp8-nvfp4-vision-exclude

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…6293731, 6293762) The general PTQ presets quantize via broad wildcards: `fp8_default` enables bare `*weight_quantizer` / `*input_quantizer` (the `w8a8_fp8_fp8` unit) and `nvfp4_mlp_only` enables `*mlp*`. On multimodal checkpoints (e.g. gemma-4-31B-it) these also match the SigLIP vision tower (`model.vision_tower.*`, `model.visual.*`) and the vision embedding projection (`model.embed_vision.*`): - fp8_default-kv_fp8: FP8-quantizes the vision branch; the checkpoint deploys but emits garbled text in sglang (NVBug 6293731). - nvfp4_mlp_only-kv_fp8: NVFP4-quantizes the vision block MLPs; the FP4 kernel crashes at decode with `too many values to unpack (expected 2)` (NVBug 6293762). Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared `configs/ptq/units/default_disabled_quantizers` unit, alongside the existing `*router*` / `*lm_head*` entries. Because both the composed `general/ptq/*` recipes and the `configs/ptq/presets/model/*` presets import this unit, every general recipe keeps the vision branch in BF16 by default and the YAML<->preset parity test stays satisfied. No-op on text-only models; a recipe that intentionally quantizes vision can re-enable after importing this unit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

codecov · 2026-06-12T06:45:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.73%. Comparing base (dd49a46) to head (513862e).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1691      +/-   ##
==========================================
+ Coverage   67.72%   67.73%   +0.01%     
==========================================
  Files         511      511              
  Lines       56168    56168              
==========================================
+ Hits        38037    38043       +6     
+ Misses      18131    18125       -6

Flag	Coverage Δ
unit	`54.34% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen · 2026-06-12T17:07:01Z

+  # crashes export / produces garbage image embeddings on VL models (gemma-4,
+  # Qwen3.5-VL — NVBugs 6293731, 6293762, 6294017). A recipe that intentionally
+  # quantizes vision must re-enable these after importing this unit.
+  - quantizer_name: '*embed_vision*'


I recently added vision tower and visual for qwen3.6:

Model-Optimizer/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml

Line 41 in 60b1af5

- quantizer_name: '*visual*'

Could you rebase and resolve the overlap?

Let me do it in #1690

Resolved in #1690 (commit 0cf494b). I rebased #1690 onto main and removed the duplicate bare *visual* / *vision_tower* entries your qwen3.6 change added, keeping the single documented block that disables *vision_tower* / *visual* / *embed_vision* — so each glob appears exactly once. I also dropped the now-redundant explicit vision excludes from the new huggingface/gemma4/ptq/w4a8_awq-kv_fp8_cast.yaml recipe since they're inherited from the shared unit. Verified load_recipe still resolves all three globs as disabled with *weight_quantizer enabled (INT4).

github-actions · 2026-06-12T17:21:59Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-12 17:21 UTC

…rlap Follow-up to #1691 (merged) and meenchen's qwen3.6 vision-exclusion addition, both of which landed `*vision_tower*` / `*visual*` in default_disabled_quantizers. - default_disabled_quantizers.yaml: remove the duplicate bare `*visual*` / `*vision_tower*` entries (qwen3.6) now that the documented block already disables `*vision_tower*` / `*visual*` / `*embed_vision*`. One source of truth. - gemma4 w4a8_awq recipe: drop the now-redundant explicit `*vision_tower*` / `*embed_vision*` excludes — they are inherited from the shared default_disabled_quantizers unit (imported last so its disables win). The recipe is now just the gemma-specific awq_lite alpha_step=1 numerics. - Update the gemma4 recipe comment / README to reflect the shared-unit source. Verified: load_recipe on the gemma4 recipe resolves `*vision_tower*` / `*visual*` / `*embed_vision*` as disabled (via the shared unit) with `*weight_quantizer` still enabled (INT4). Fixes NVBug 6294017. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 requested a review from a team as a code owner June 11, 2026 21:29

Edwardf0t1 requested a review from sychen52 June 11, 2026 21:29

coderabbitai Bot approved these changes Jun 11, 2026

View reviewed changes

Edwardf0t1 requested review from juhi10071998, kevalmorabia97 and shengliangxu June 11, 2026 22:49

meenchen reviewed Jun 11, 2026

View reviewed changes

Comment thread modelopt_recipes/general/ptq/fp8_default-kv_fp8.yaml Outdated

Edwardf0t1 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 12, 2026

Edwardf0t1 changed the title ~~Fix gemma-4 fp8_default / nvfp4_mlp_only recipes quantizing vision branch in sglang (NVBug 6293731, 6293762)~~ Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) Jun 12, 2026

meenchen approved these changes Jun 12, 2026

View reviewed changes

Edwardf0t1 merged commit 28c9601 into main Jun 12, 2026
35 checks passed

Edwardf0t1 deleted the fix-gemma4-fp8-nvfp4-vision-exclude branch June 12, 2026 17:21

coderabbitai Bot mentioned this pull request Jun 12, 2026

Fix gemma w4a8_awq recipe crashing export on multimodal checkpoints (NVBug 6294017) #1690

Open

juhi10071998 mentioned this pull request Jun 13, 2026

Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ] #1707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691

Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691
Edwardf0t1 merged 2 commits into
mainfrom
fix-gemma4-fp8-nvfp4-vision-exclude

Edwardf0t1 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Estimated code review effort

Uh oh!

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

meenchen Jun 12, 2026

Uh oh!

Edwardf0t1 Jun 12, 2026

Uh oh!

Edwardf0t1 Jun 12, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Edwardf0t1 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Fix

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Estimated code review effort

Uh oh!

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

meenchen Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Edwardf0t1 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading