Skip to content

Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691

Merged
Edwardf0t1 merged 2 commits into
mainfrom
fix-gemma4-fp8-nvfp4-vision-exclude
Jun 12, 2026
Merged

Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691
Edwardf0t1 merged 2 commits into
mainfrom
fix-gemma4-fp8-nvfp4-vision-exclude

Conversation

@Edwardf0t1

@Edwardf0t1 Edwardf0t1 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug fix

Fixes two sglang deployment failures on multimodal Gemma (gemma-4-31B-it) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards:

  • NVBug 6293731general/ptq/fp8_default-kv_fp8: the w8a8_fp8_fp8 unit enables bare *weight_quantizer / *input_quantizer, which also match the vision tower (model.vision_tower.*, model.visual.*) and the vision embedding projection (model.embed_vision.*). The exported checkpoint deploys but emits garbled text in sglang.
  • NVBug 6293762general/ptq/nvfp4_mlp_only-kv_fp8: the *mlp* enables also match the vision tower's block MLPs (model.vision_tower.encoder.layers.*.mlp), and an image request crashes the FP4 kernel at decode: ValueError: too many values to unpack (expected 2) in sglang's modelopt_quant.py apply.

Fix

Add *embed_vision* / *vision_tower* / *visual* disable rules to the shared configs/ptq/units/default_disabled_quantizers unit, alongside the existing *router* / *lm_head* entries.

Both the composed general/ptq/* recipes and the configs/ptq/presets/model/* presets import this unit, so:

  • every general recipe (fp8_default, nvfp4_default, nvfp4_mlp_only, nvfp4_omlp_only, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes;
  • the test_general_ptq_yaml_matches_config_dicts YAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit).

The rules are no-ops on text-only models (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit.

Files changed:

  • modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml (+14)

Testing

Re-export of gemma-4-31B-it with the affected recipes and re-deploy in sglang (the env from the bug reports: lmsysorg/sglang:v0.5.12.post1, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. (Results to be appended.) Unit-level: tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts (parity) passes for all four general presets.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (text-only checkpoints unaffected; new rules only match vision modules that should never have been quantized by a general recipe)
  • If you copied code from any other sources or added a new PIP dependency: N/A
  • Did you write any new necessary tests?: N/A (recipe data fix; covered by the existing parity test + verified by real PTQ export + sglang deploy)
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific w4a8_awq recipe, NVBug 6294017).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Updated quantization configuration to preserve BF16 precision for vision encoder components in multimodal models.

…anch in sglang (NVBug 6293731, 6293762)

The general PTQ presets `fp8_default-kv_fp8` and `nvfp4_mlp_only-kv_fp8`
(and their `_cast` KV siblings) enable quantization with broad wildcards
that, on multimodal Gemma checkpoints (e.g. gemma-4-31B-it), also match the
SigLIP vision tower (`model.vision_tower.*`), the vision embedding projection
(`model.embed_vision.*`), and the vision block MLPs:

  - `fp8_default`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` /
    `*input_quantizer`, FP8-quantizing the whole vision branch. The exported
    checkpoint then deploys but emits garbled text in sglang (NVBug 6293731).
  - `nvfp4_mlp_only`: the `*mlp*` enables match
    `vision_tower.encoder.layers.*.mlp`, so the FP4 kernel crashes at decode
    with `ValueError: too many values to unpack (expected 2)` in sglang's
    modelopt_quant apply path (NVBug 6293762).

Add trailing `*visual*` / `*vision_tower*` / `*embed_vision*` disable rules
(placed after the enables and `default_disabled_quantizers` so the disable
wins), keeping the vision branch in BF16. Mirrors the vision exclusions
already shipped in the gemma w4a8_awq / qwen3_5 / nemotron_vl recipes. The
rules are no-ops on text-only models.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 requested a review from a team as a code owner June 11, 2026 21:29
@Edwardf0t1 Edwardf0t1 requested a review from sychen52 June 11, 2026 21:29
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c9f8ae1-09ab-40c2-9240-ae3f30f5b2ec

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2acad and 513862e.

📒 Files selected for processing (1)
  • modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml

📝 Walkthrough

Walkthrough

This PR extends the PTQ default quantizer disable configuration to explicitly exclude vision and multimodal components from quantization by adding three new pattern-matching rules (*embed_vision*, *vision_tower*, *visual*) with documentation that these components remain in BF16 format unless downstream recipes re-enable them.

Changes

Quantization Recipe Configuration Update

Layer / File(s) Summary
Vision component exclusion patterns in default quantizer config
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
Three new quantizer disable entries for *embed_vision*, *vision_tower*, and *visual* patterns are added to the default disabled quantizers configuration, with accompanying comments explaining that vision encoders and multimodal embedding projections remain in BF16 by default.

Possibly related PRs

  • NVIDIA/Model-Optimizer#1687: Both PRs modify the same PTQ quantizer-disabling YAML configuration to add rules keeping vision and multimodal components unquantized for NVFP4.

Suggested reviewers

  • shengliangxu

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: excluding multimodal vision branches from quantization in PTQ recipes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed git diff vs origin/main shows no Python changes under modelopt/ or examples/ (only YAML plus non-scope test/tool files). No SECURITY.md anti-patterns can be introduced in-scope Python.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-gemma4-fp8-nvfp4-vision-exclude

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread modelopt_recipes/general/ptq/fp8_default-kv_fp8.yaml Outdated
@Edwardf0t1 Edwardf0t1 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 12, 2026
…6293731, 6293762)

The general PTQ presets quantize via broad wildcards: `fp8_default` enables
bare `*weight_quantizer` / `*input_quantizer` (the `w8a8_fp8_fp8` unit) and
`nvfp4_mlp_only` enables `*mlp*`. On multimodal checkpoints (e.g. gemma-4-31B-it)
these also match the SigLIP vision tower (`model.vision_tower.*`,
`model.visual.*`) and the vision embedding projection (`model.embed_vision.*`):

  - fp8_default-kv_fp8: FP8-quantizes the vision branch; the checkpoint deploys
    but emits garbled text in sglang (NVBug 6293731).
  - nvfp4_mlp_only-kv_fp8: NVFP4-quantizes the vision block MLPs; the FP4 kernel
    crashes at decode with `too many values to unpack (expected 2)` (NVBug 6293762).

Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared
`configs/ptq/units/default_disabled_quantizers` unit, alongside the existing
`*router*` / `*lm_head*` entries. Because both the composed `general/ptq/*`
recipes and the `configs/ptq/presets/model/*` presets import this unit, every
general recipe keeps the vision branch in BF16 by default and the YAML<->preset
parity test stays satisfied. No-op on text-only models; a recipe that
intentionally quantizes vision can re-enable after importing this unit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 changed the title Fix gemma-4 fp8_default / nvfp4_mlp_only recipes quantizing vision branch in sglang (NVBug 6293731, 6293762) Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.73%. Comparing base (dd49a46) to head (513862e).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1691      +/-   ##
==========================================
+ Coverage   67.72%   67.73%   +0.01%     
==========================================
  Files         511      511              
  Lines       56168    56168              
==========================================
+ Hits        38037    38043       +6     
+ Misses      18131    18125       -6     
Flag Coverage Δ
unit 54.34% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# crashes export / produces garbage image embeddings on VL models (gemma-4,
# Qwen3.5-VL — NVBugs 6293731, 6293762, 6294017). A recipe that intentionally
# quantizes vision must re-enable these after importing this unit.
- quantizer_name: '*embed_vision*'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently added vision tower and visual for qwen3.6:


Could you rebase and resolve the overlap?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do it in #1690

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in #1690 (commit 0cf494b). I rebased #1690 onto main and removed the duplicate bare *visual* / *vision_tower* entries your qwen3.6 change added, keeping the single documented block that disables *vision_tower* / *visual* / *embed_vision* — so each glob appears exactly once. I also dropped the now-redundant explicit vision excludes from the new huggingface/gemma4/ptq/w4a8_awq-kv_fp8_cast.yaml recipe since they're inherited from the shared unit. Verified load_recipe still resolves all three globs as disabled with *weight_quantizer enabled (INT4).

@Edwardf0t1 Edwardf0t1 merged commit 28c9601 into main Jun 12, 2026
35 checks passed
@Edwardf0t1 Edwardf0t1 deleted the fix-gemma4-fp8-nvfp4-vision-exclude branch June 12, 2026 17:21
@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-12 17:21 UTC

Edwardf0t1 added a commit that referenced this pull request Jun 12, 2026
…rlap

Follow-up to #1691 (merged) and meenchen's qwen3.6 vision-exclusion addition,
both of which landed `*vision_tower*` / `*visual*` in default_disabled_quantizers.

- default_disabled_quantizers.yaml: remove the duplicate bare `*visual*` /
  `*vision_tower*` entries (qwen3.6) now that the documented block already
  disables `*vision_tower*` / `*visual*` / `*embed_vision*`. One source of truth.
- gemma4 w4a8_awq recipe: drop the now-redundant explicit `*vision_tower*` /
  `*embed_vision*` excludes — they are inherited from the shared
  default_disabled_quantizers unit (imported last so its disables win). The
  recipe is now just the gemma-specific awq_lite alpha_step=1 numerics.
- Update the gemma4 recipe comment / README to reflect the shared-unit source.

Verified: load_recipe on the gemma4 recipe resolves `*vision_tower*` /
`*visual*` / `*embed_vision*` as disabled (via the shared unit) with
`*weight_quantizer` still enabled (INT4). Fixes NVBug 6294017.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants