Qwen-Image diffusers PTQ: FP8 / NVFP4 / NVFP4-SVDQuant HF checkpoints#1706
Qwen-Image diffusers PTQ: FP8 / NVFP4 / NVFP4-SVDQuant HF checkpoints#1706jingyu-ml wants to merge 16 commits into
Conversation
Register Qwen/Qwen-Image as a supported model in the diffusers quantization example: - ModelType.QWEN_IMAGE and lazy-imported QwenImagePipeline (so the example still imports on older diffusers). - MODEL_REGISTRY / MODEL_PIPELINE / MODEL_DEFAULTS entries (backbone="transformer", text-to-image calibration dataset). - An actionable ImportError when the installed diffusers lacks Qwen classes, instead of an opaque failure. - filter_func_qwen_image: quantize only transformer_blocks, keeping the first two and last two of the 60 blocks (and everything outside transformer_blocks) in original precision. Enables the plain FP8/NVFP4 export path for Qwen-Image. Core SVDQuant code is unchanged. (Qwen-Image SVDQuant checkpoint work, RLCR round 0 / M1.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…harness
Implements the Qwen-Image NVFP4/FP8/SVDQuant diffusers quantization feature
(RLCR round 0 / M2-M5), keeping core SVDQuant code unchanged:
M2 (recipe): build_block_range_quant_cfg() emits ordered quant_cfg rules
(disable-all -> enable *.transformer_blocks.* -> disable first/last-N), applied
pre-calibration in Quantizer.get_quant_config so SVDQuant never mutates the
excluded blocks. Driven by a MODEL_DEFAULTS["block_range"] entry for Qwen-Image
(exclude first 2 / last 2; n derived from the model; n>=first+last+1 enforced).
M3 (export): _export_diffusers_checkpoint now promotes quantizer-owned tensors
to clean module-level safetensors keys before hide_quantizers_from_state_dict
(diffusers path only; the transformers path keeps its postprocess_state_dict
rename): input_quantizer._pre_quant_scale -> <module>.pre_quant_scale (AWQ key),
weight_quantizer.svdquant_lora_a/b -> <module>.svdquant_lora_a/b. Adds an
NVFP4_SVD branch to convert_hf_config (modeled on nvfp4_awq: pre_quant_scale +
lora_rank), and process_layer_quant_config now flags SVDQuant with
pre_quant_scale=True. This also resolves the diffusers pre_quant_scale TODO for
AWQ-style exports.
M4 (tests): unit tests for the block-range recipe (first/last-2 exclusion,
n>=6 validation) and the NVFP4_SVD HF config conversion.
M5 (harness): quantize.py --sanity-image-path (in-memory quantized-inference
image, pre-export) + examples/diffusers/quantization/qwen_image_svdquant/
{run_qwen_image_quantization.sh, README.md} (parameterized container/model/
export flow for FP8/NVFP4/SVDQuant).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
… tests Addresses the round-0 Codex review (RLCR round 1): Blocking fixes: - convert_hf_config: NVFP4_SVD config groups now keep `has_zero_point: False` (both convert_hf_quant_config_format and _quant_algo_to_group_config); asserted in the unit test. - build_block_range_quant_cfg: minimum is now first+last+2 (>=2 quantized middle blocks; n>=6 for the 2+2 Qwen recipe); recipe test rejects 5/4/3-block models. - quantize.py --sanity-image-path failures are now fatal (re-raise -> non-zero exit) so the harness cannot report success without the image; the harness also verifies sanity.png + safetensors + config.json exist per format. Qwen export enablement: - diffusers_utils.generate_diffusion_dummy_inputs: add a QwenImageTransformer2DModel branch (packed latents [B,(H//2)(W//2),C], encoder_hidden_states_mask, img_shapes, txt_seq_lens, optional guidance, continuous timestep). - unified_export_hf._fuse_qkv_linears_diffusion gains strict=; Qwen QKV fusion now fails hard instead of silently skipping. Promotion buffers now overwrite on re-export. create_pipeline_from gives the same actionable Qwen import error. Tests: - New tests/unit/torch/quantization/test_svdquant_forward_fold.py: LoRA stays on weight_quantizer, forward includes a nonzero residual, fold_weight folds it and drops the buffers (existing test_svdquant_lora_weights left unmodified). Deferred to Round 2 / cluster: tiny Qwen2_5_VL fixture + full diffusers e2e export test (needs a Qwen-capable diffusers + GPU); the actual AC-7 checkpoint run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…minology Round 2 (addresses round-1 Codex review: the round-1 code had no direct test coverage). Adds tests/unit/torch/export/test_diffusers_qwen_export.py: - Qwen dummy inputs: generate_diffusion_dummy_inputs builds the expected keys for a real tiny QwenImageTransformer2DModel, and the generated dummy forward runs on it (this is what catches any wrong shape/kwarg in the dummy-input builder). - Strict fusion: _fuse_qkv_linears_diffusion(strict=True) re-raises on a failing dummy forward; strict=False does not. - Structural export: _promote_quantizer_tensors_to_module promotes SVDQuant LoRA + pre_quant_scale to clean module keys that survive hide_quantizers_from_state_dict (promoted <module>.svdquant_lora_a/b + <module>.pre_quant_scale present; weight_quantizer / input_quantizer keys absent), on a calibrated tiny SVDQuant MLP. Also removes plan/workflow terminology (DEC-5, "pre-calibration") from source and test comments per the plan code-style note. Still pending (Round 3 / cluster): the full tiny Qwen pipeline fixture + e2e subprocess export test (needs diffusers' tokenizer/text-encoder construction and a GPU) and the AC-7 cluster run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Round 3 (addresses round-2 Codex review): - Fix the tiny Qwen-Image pipeline fixture (tests/_test_utils/torch/diffusers_models.py): build the Qwen2.5-VL text encoder inline from a tiny Qwen2_5_VLConfig (no Hub model load; the previous hf-internal-testing/...Qwen2_5_VL id does not exist), load the tokenizer from the tiny ...Qwen2VL id diffusers' own fast test uses, build the transformer with num_layers=6 (so the corrected first-2/last-2 block-range recipe, which needs >=6 blocks, is valid) and joint_attention_dim=16 matching the text encoder hidden_size, and a z_dim=4 VAE. Mirrors diffusers' QwenImagePipelineFastTests.get_dummy_components. - Add Qwen FP8 / NVFP4 / NVFP4-SVDQuant cases to test_export_diffusers_hf_ckpt.py using the tiny fixture. The test opens transformer/config.json and the exported safetensors and asserts: quant_method=modelopt; no weight_quantizer / input_quantizer._amax keys; for SVDQuant, promoted <module>.svdquant_lora_a/b + <module>.pre_quant_scale keys, config group pre_quant_scale/has_zero_point/ lora_rank, and non-empty ignore (excluded blocks); for plain formats, weight_scale. GPU/diffusers skip-guarded. - Drop remaining workflow terminology (Step 4.5, before-calibration) from the comments I introduced. Still cluster-only (no GPU here): executing these tests and the AC-7 harness run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ep comments
Round 4 (addresses round-3 Codex review):
- Offline tiny Qwen tokenizer: _build_local_qwen2_tokenizer builds a deterministic
byte-level Qwen2 tokenizer locally (GPT-2 byte->unicode vocab + Qwen specials,
empty merges) instead of a Hub load; removes the tokenizer-unavailable skip path.
- Strengthen test_qwen_image_hf_ckpt_export: assert equal module-prefix sets for
.svdquant_lora_a/.svdquant_lora_b/.pre_quant_scale; promoted linears are a subset
of weight-scaled linears; only the middle blocks {2,3} of 6 are quantized (first-2/
last-2 excluded); lora_a=[rank,in]/lora_b=[out,rank] with rank == --lowrank (8);
NVFP4 weight_scale_2 present; exact config (quant_algo=NVFP4_SVD, lora_rank=8,
pre_quant_scale=True, has_zero_point=False, non-empty ignore).
- Remove the remaining "Step N:" workflow comments from unified_export_hf.py
(the round-3 "grep clean" claim was wrong; verified clean across the whole file).
Still cluster-only (no GPU/torch/diffusers here): executing these tests and the
AC-7 harness run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…port test
Round 5 (addresses round-4 Codex review, which found a regression I introduced):
- The round-4 edit inserted the _module_prefixes/_block_indices helpers between
@pytest.mark.parametrize("qwen_model", ...) and test_qwen_image_hf_ckpt_export,
so the decorator was attached to the helper and the test would request an
undefined qwen_model fixture. Moved the helpers/constants above the decorator so
it directly decorates the test (verified via ast: the test now carries the
qwen_model parametrization and the helper is undecorated).
- Tightened SVDQuant assertions: require a_prefixes == b_prefixes == pqs_prefixes
== weight_scale_prefixes (every quantized linear is promoted, no gaps), and
assert every quantized prefix is under transformer_blocks (nothing outside is
quantized), in addition to the {2,3}-only block check.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Round 6 (round-5 review found no code blocker; only the queued docstring nit): the create_tiny_qwen_image_pipeline_dir docstring still said the tokenizer was fetched from the Hub, but Round 4 switched it to a local offline build (_build_local_qwen2_tokenizer). Updated the wording to "fully offline". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…tale docs
Round 7 (addresses round-6 Codex review's two missing-coverage items):
- AC-2.2 SVDQuant immutability test (test_qwen_block_range_recipe.py): builds a
6-block backbone, snapshots the excluded first/last block linear weights, runs
SVDQuant via build_block_range_quant_cfg, and asserts the excluded blocks'
weights are bit-identical (never calibrated) with no LoRA, while the middle
blocks {2,3} receive LoRA and have their weights modified.
- AC-1 negative-loading tests (new test_qwen_pipeline_loading.py): monkeypatch
MODEL_PIPELINE[QWEN_IMAGE]=None and assert the actionable ImportError; a fake
pipeline asserts create_pipeline does not pass trust_remote_code.
Stale-doc cleanups: the resolved pre_quant_scale TODO wording in
unified_export_hf.py; the build_block_range_quant_cfg docstring (first+last+1 ->
+2); the conftest "SKETCH" wording (the fixture is now a working offline build).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…e-gate Round 8 (addresses round-7 Codex review, which verified against the diffusers source that QwenImageTransformer2DModel.forward has no txt_seq_lens parameter): - _qwen_inputs no longer passes txt_seq_lens (the real forward signature is hidden_states, encoder_hidden_states, encoder_hidden_states_mask, timestep, img_shapes, guidance, return_dict). Passing txt_seq_lens would have raised an unexpected-keyword error and, because Qwen export uses strict QKV fusion, hard-failed the export. - Signature-gate the dummy inputs: filter to the kwargs the installed model's forward actually accepts (via inspect.signature), so diffusers-version drift cannot hard-fail strict fusion either. - Update test_diffusers_qwen_export.py: no longer require txt_seq_lens. - Remove AC- plan terminology from two test docstrings (code-style note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…after export Round 9 (clears the last queued code item from Codex; no code blockers remain): _promote_quantizer_tensors_to_module left the temporary <module>.svdquant_lora_a/b + <module>.pre_quant_scale buffers on the live module after export. Add _remove_promoted_quantizer_tensors and call it after each quantized diffusers component is saved, so the live module is unchanged post-export (repeated export / module reuse stay correct). The quantizer-owned tensors are untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…dquant) Validated end-to-end on GB200 against the real Qwen/Qwen-Image: all three formats export correct HF checkpoints (only transformer_blocks 2..57; nothing outside), no quantizer-state leak, and the focused tests pass. - models_utils: build_block_range_quant_cfg now uses the top-level enable QuantizerCfgEntry field (a None cfg retains the base preset's params) instead of nesting cfg.enable, which the QuantizerAttributeConfig validator rejects/mis-applies (the old form left every block quantized). - quantize.py: import onnx_utils.export lazily (only needed for --onnx-dir; avoids a hard onnx_graphsurgeon dependency), and pass max_shard_size so the ~20B transformer saves as a single safetensors -- the unified export's layerwise-metadata post-processing does not support sharded files. - diffusers_utils: hide_quantizers_from_state_dict strips quantizer submodules from all modules, not only is_quantlinear, so enabled input quantizers on norm layers no longer leak input_quantizer._amax into the checkpoint. - tests: the tiny QwenImageTransformer2DModel fixture signature-gates its kwargs (diffusers 0.38 removed pooled_projection_dim from the constructor); the recipe test asserts the corrected top-level enable schema. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
📝 WalkthroughWalkthroughThis PR extends the diffusers quantization example with Qwen-Image model support, including a selective block-range quantization strategy that excludes transformer edge blocks, SVDQuant low-rank export infrastructure, offline test utilities, and comprehensive validation tests. ChangesQwen-Image quantization harness
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
run_qwen_image_quantization.sh and its README are cluster-specific experiment/operator scripts (hard-coded /lustre paths) that do not belong in the upstream diffusers example. The feature itself (model registration, block-range recipe, FP8/NVFP4/SVDQuant export) is covered by the committed tests. The scripts are kept locally outside the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/export/unified_export_hf.py (1)
1174-1221: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winAlways clean up promoted buffers with
try/finally.On Line 1174, promoted export buffers are added, but cleanup on Line 1219-1221 runs only on the success path. Any exception in save/postprocess/config update leaves the live component mutated (
pre_quant_scale/svdquant_lora_*buffers lingering).Proposed fix
- _promote_quantizer_tensors_to_module(component) - - # Build quantization config - quant_config = get_quant_config(component, is_modelopt_qlora=False) - if quant_config: - quantization_details = quant_config.get("quantization", {}) - # Record the SVDQuant low-rank size so consumers know the LoRA shape. - if quantization_details.get("quant_algo") == "NVFP4_SVD": - svdquant_rank = _detect_svdquant_rank(component) - if svdquant_rank is not None: - quantization_details["lora_rank"] = svdquant_rank - hf_quant_config = convert_hf_quant_config_format(quant_config) if quant_config else None - - # Save the component - # - diffusers ModelMixin.save_pretrained does NOT accept state_dict parameter - # - for non-diffusers modules (e.g., LTX-2 transformer), fall back to torch.save - if hasattr(component, "save_pretrained"): - with hide_quantizers_from_state_dict(component): - component.save_pretrained(component_export_dir, max_shard_size=max_shard_size) - else: - with hide_quantizers_from_state_dict(component): - _save_component_state_dict_safetensors(component, component_export_dir) - - # Post-process — merge, metadata, padding, swizzle - _postprocess_safetensors( - component_export_dir, - pipe, - hf_quant_config=hf_quant_config, - **kwargs, - ) - - # Update config.json with quantization info - if hf_quant_config is not None: - config_path = component_export_dir / "config.json" - if config_path.exists(): - with open(config_path) as file: - config_data = json.load(file) - config_data["quantization_config"] = hf_quant_config - with open(config_path, "w") as file: - json.dump(config_data, file, indent=4) - - # Drop the temporary promoted export buffers so the live module is - # unchanged after export (supports repeated export / module reuse). - _remove_promoted_quantizer_tensors(component) + _promote_quantizer_tensors_to_module(component) + try: + # Build quantization config + quant_config = get_quant_config(component, is_modelopt_qlora=False) + if quant_config: + quantization_details = quant_config.get("quantization", {}) + if quantization_details.get("quant_algo") == "NVFP4_SVD": + svdquant_rank = _detect_svdquant_rank(component) + if svdquant_rank is not None: + quantization_details["lora_rank"] = svdquant_rank + hf_quant_config = convert_hf_quant_config_format(quant_config) if quant_config else None + + if hasattr(component, "save_pretrained"): + with hide_quantizers_from_state_dict(component): + component.save_pretrained(component_export_dir, max_shard_size=max_shard_size) + else: + with hide_quantizers_from_state_dict(component): + _save_component_state_dict_safetensors(component, component_export_dir) + + _postprocess_safetensors( + component_export_dir, + pipe, + hf_quant_config=hf_quant_config, + **kwargs, + ) + + if hf_quant_config is not None: + config_path = component_export_dir / "config.json" + if config_path.exists(): + with open(config_path) as file: + config_data = json.load(file) + config_data["quantization_config"] = hf_quant_config + with open(config_path, "w") as file: + json.dump(config_data, file, indent=4) + finally: + _remove_promoted_quantizer_tensors(component)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_hf.py` around lines 1174 - 1221, You promote quantizer-owned tensors with _promote_quantizer_tensors_to_module but only call _remove_promoted_quantizer_tensors on the success path, so exceptions during save/postprocess/config update leave the module mutated; wrap the work that occurs after promotion (the save path using hide_quantizers_from_state_dict + component.save_pretrained or _save_component_state_dict_safetensors, _postprocess_safetensors, and the config.json update that uses hf_quant_config) in a try/finally and call _remove_promoted_quantizer_tensors(component) in the finally so cleanup always runs; preserve and re-raise any exception after cleanup to avoid swallowing errors.examples/diffusers/quantization/quantize.py (1)
111-121: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winMake the Qwen block-range mask backbone-aware and use a single source of truth.
get_quant_config()always injectsMODEL_DEFAULTS[QWEN_IMAGE]["block_range"], andquantize_model()always follows withget_model_filter_func(). For Qwen-Image that creates two concrete failure modes:--backbone transformer vaewill raise when the VAE path hitsbuild_block_range_quant_cfg()with notransformer_blocks, and any local override checkpoint whose transformer depth is not exactly 60 will calibrate with one exclusion mask but be post-disabled with the hard-coded 60-block mask fromexamples/diffusers/quantization/utils.py. Please gate the recipe/filter to the transformer backbone and derive both from the loaded backbone instead of keeping two independent masks.Also applies to: 171-191, 223-233, 696-709
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/quantization/quantize.py` around lines 111 - 121, get_quant_config currently always injects MODEL_DEFAULTS[QWEN_IMAGE]["block_range"] and quantize_model/get_model_filter_func apply a separate hard-coded 60-block mask, causing mismatch when backbone != transformer or transformer depth != 60; fix by making the Qwen block-range mask backbone-aware and deriving it from the loaded backbone (e.g., transformer_blocks, num_layers or backbone.config.*) as the single source of truth: update get_quant_config to consult the actual backbone type and depth and compute block_range via build_block_range_quant_cfg(backbone_depth) instead of using MODEL_DEFAULTS[QWEN_IMAGE]["block_range"], and update quantize_model and get_model_filter_func to use that same computed mask (remove hard-coded masks in examples/diffusers/quantization/utils.py) so all three locations (get_quant_config, quantize_model, get_model_filter_func / build_block_range_quant_cfg) reference the same backbone-derived value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/diffusers/quantization/quantize.py`:
- Around line 581-589: The CLI currently accepts --sanity-image-path
unconditionally and assumes generated outputs have images, causing late failures
for video/non-image pipelines; update argument validation in quantize.py to
reject --sanity-image-path early when the selected pipeline type is not an image
pipeline: after parsing args (or inside the existing validation function / main
pipeline selection flow), detect the pipeline kind via the pipeline ID or class
name used for inference (the same symbol(s) that decide which pipeline to
instantiate) and raise an error or exit if --sanity-image-path is set but the
pipeline is not one of the known image pipelines (e.g., StableDiffusion/Any
Image* pipelines); apply the same guard for the second occurrence of this block
noted around the other lines so non-image pipelines fail at argument validation
time rather than after a full run.
In
`@examples/diffusers/quantization/qwen_image_svdquant/run_qwen_image_quantization.sh`:
- Around line 39-40: The script claims DRY_RUN previews commands but still
performs side effects and hard-fails on missing tokens; update logic so when
DRY_RUN is set (check DRY_RUN or use a helper dry_run() wrapper) you skip/avoid
any real file checks and mutations: only echo planned actions instead of
performing them, skip the HF_TOKEN_FILE existence/readability checks (do not
exit with error) and skip creating OUTPUT_DIR (do not run mkdir -p) and any file
writes; specifically wrap or conditionalize the HF_TOKEN_FILE checks (the
HF_TOKEN_FILE variable handling) and the mkdir -p or other filesystem operations
that create ${OUTPUT_DIR} so they only execute when DRY_RUN is not set, and
ensure any commands that would modify disk are printed when DRY_RUN=1 rather
than executed.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1024-1035: The code currently returns the first observed SVDQuant
rank via _detect_svdquant_rank by inspecting weight_quantizer.svdquant_lora_a,
which can hide inconsistencies across modules; update the logic to scan all
modules' weight_quantizer.svdquant_lora_a values, collect all unique ranks, and:
if none found return None, if exactly one unique rank use that, otherwise raise
or log an explicit error and refuse to write a single lora_rank metadata value.
Apply this validation before serializing the lora_rank metadata (where lora_rank
is written) so you never serialize an incorrect single rank when multiple
different ranks exist. Ensure you reference the same attributes
(weight_quantizer, svdquant_lora_a) and the _detect_svdquant_rank helper (or
replace it with a function that returns the set/validates) so callers can act on
the validation result.
In `@tests/_test_utils/torch/diffusers_models.py`:
- Line 296: Move the deferred "import inspect" (and any other imports added
inside tests) to the module/top-level in
tests/_test_utils/torch/diffusers_models.py (and the other referenced test
files: tests/examples/diffusers/test_qwen_block_range_recipe.py,
tests/examples/diffusers/test_export_diffusers_hf_ckpt.py,
tests/unit/torch/export/test_diffusers_qwen_export.py) so imports are
module-level by default; if an import truly must be deferred (circular or
optional dependency), keep it but add a one-line comment above the deferred
import explaining the specific reason and link to the offending symbol (e.g.,
the "import inspect" line) so reviewers can verify the justification.
---
Outside diff comments:
In `@examples/diffusers/quantization/quantize.py`:
- Around line 111-121: get_quant_config currently always injects
MODEL_DEFAULTS[QWEN_IMAGE]["block_range"] and
quantize_model/get_model_filter_func apply a separate hard-coded 60-block mask,
causing mismatch when backbone != transformer or transformer depth != 60; fix by
making the Qwen block-range mask backbone-aware and deriving it from the loaded
backbone (e.g., transformer_blocks, num_layers or backbone.config.*) as the
single source of truth: update get_quant_config to consult the actual backbone
type and depth and compute block_range via
build_block_range_quant_cfg(backbone_depth) instead of using
MODEL_DEFAULTS[QWEN_IMAGE]["block_range"], and update quantize_model and
get_model_filter_func to use that same computed mask (remove hard-coded masks in
examples/diffusers/quantization/utils.py) so all three locations
(get_quant_config, quantize_model, get_model_filter_func /
build_block_range_quant_cfg) reference the same backbone-derived value.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1174-1221: You promote quantizer-owned tensors with
_promote_quantizer_tensors_to_module but only call
_remove_promoted_quantizer_tensors on the success path, so exceptions during
save/postprocess/config update leave the module mutated; wrap the work that
occurs after promotion (the save path using hide_quantizers_from_state_dict +
component.save_pretrained or _save_component_state_dict_safetensors,
_postprocess_safetensors, and the config.json update that uses hf_quant_config)
in a try/finally and call _remove_promoted_quantizer_tensors(component) in the
finally so cleanup always runs; preserve and re-raise any exception after
cleanup to avoid swallowing errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c42d794b-9dd7-41d7-ad4c-25c69901c226
📒 Files selected for processing (18)
examples/diffusers/quantization/models_utils.pyexamples/diffusers/quantization/pipeline_manager.pyexamples/diffusers/quantization/quantize.pyexamples/diffusers/quantization/qwen_image_svdquant/README.mdexamples/diffusers/quantization/qwen_image_svdquant/run_qwen_image_quantization.shexamples/diffusers/quantization/utils.pymodelopt/torch/export/convert_hf_config.pymodelopt/torch/export/diffusers_utils.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pytests/_test_utils/torch/diffusers_models.pytests/examples/diffusers/conftest.pytests/examples/diffusers/test_export_diffusers_hf_ckpt.pytests/examples/diffusers/test_qwen_block_range_recipe.pytests/examples/diffusers/test_qwen_pipeline_loading.pytests/unit/torch/export/test_convert_hf_config_svdquant.pytests/unit/torch/export/test_diffusers_qwen_export.pytests/unit/torch/quantization/test_svdquant_forward_fold.py
| export_group.add_argument( | ||
| "--sanity-image-path", | ||
| type=str, | ||
| default=None, | ||
| help="If set, generate one image from the in-memory quantized pipeline (after " | ||
| "quantization, before the weights are packed for export) and save it here. This is " | ||
| "a quick functional sanity check of quantized inference; it does NOT reload the " | ||
| "exported checkpoint.", | ||
| ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Reject --sanity-image-path for non-image pipelines at argument validation time.
This block assumes every supported model returns result.images[0], but the same CLI also supports video pipelines (LTX_*, WAN*). Today those runs will burn a full inference pass and then fail late on the save step instead of being rejected at the interface boundary.
Suggested guard
pipeline_manager.print_quant_summary()
+ if args.sanity_image_path and model_type in {
+ ModelType.LTX_VIDEO_DEV,
+ ModelType.LTX2,
+ ModelType.WAN22_T2V_14b,
+ ModelType.WAN22_T2V_5b,
+ }:
+ parser.error("--sanity-image-path is only supported for image pipelines.")
+
# Optional functional sanity check: generate one image from the in-memory
# quantized pipeline. This runs BEFORE export (while weights are still
# fake-quantized and runnable, not yet packed) and does not reload theAlso applies to: 729-750
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/diffusers/quantization/quantize.py` around lines 581 - 589, The CLI
currently accepts --sanity-image-path unconditionally and assumes generated
outputs have images, causing late failures for video/non-image pipelines; update
argument validation in quantize.py to reject --sanity-image-path early when the
selected pipeline type is not an image pipeline: after parsing args (or inside
the existing validation function / main pipeline selection flow), detect the
pipeline kind via the pipeline ID or class name used for inference (the same
symbol(s) that decide which pipeline to instantiate) and raise an error or exit
if --sanity-image-path is set but the pipeline is not one of the known image
pipelines (e.g., StableDiffusion/Any Image* pipelines); apply the same guard for
the second occurrence of this block noted around the other lines so non-image
pipelines fail at argument validation time rather than after a full run.
| def _detect_svdquant_rank(component: nn.Module) -> int | None: | ||
| """Return the SVDQuant low-rank dimension from the first SVDQuant linear, if any. | ||
|
|
||
| ``svdquant_lora_a`` has shape ``(rank, in_features)``, so its first dimension | ||
| is the low-rank size. | ||
| """ | ||
| for _, sub_module in component.named_modules(): | ||
| weight_quantizer = getattr(sub_module, "weight_quantizer", None) | ||
| lora_a = getattr(weight_quantizer, "svdquant_lora_a", None) | ||
| if lora_a is not None: | ||
| return int(lora_a.shape[0]) | ||
| return None |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Validate SVDQuant rank consistency before writing lora_rank metadata.
On Line 1024, _detect_svdquant_rank() returns the first observed rank. If different quantized modules carry different svdquant_lora_a ranks, Line 1185-1188 will serialize a single incorrect lora_rank, which can misrepresent the exported checkpoint contract.
Proposed fix
def _detect_svdquant_rank(component: nn.Module) -> int | None:
@@
- for _, sub_module in component.named_modules():
+ ranks: set[int] = set()
+ for _, sub_module in component.named_modules():
weight_quantizer = getattr(sub_module, "weight_quantizer", None)
lora_a = getattr(weight_quantizer, "svdquant_lora_a", None)
if lora_a is not None:
- return int(lora_a.shape[0])
- return None
+ ranks.add(int(lora_a.shape[0]))
+ if not ranks:
+ return None
+ if len(ranks) != 1:
+ raise ValueError(f"Inconsistent SVDQuant ranks detected across modules: {sorted(ranks)}")
+ return next(iter(ranks))Also applies to: 1185-1188
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/export/unified_export_hf.py` around lines 1024 - 1035, The
code currently returns the first observed SVDQuant rank via
_detect_svdquant_rank by inspecting weight_quantizer.svdquant_lora_a, which can
hide inconsistencies across modules; update the logic to scan all modules'
weight_quantizer.svdquant_lora_a values, collect all unique ranks, and: if none
found return None, if exactly one unique rank use that, otherwise raise or log
an explicit error and refuse to write a single lora_rank metadata value. Apply
this validation before serializing the lora_rank metadata (where lora_rank is
written) so you never serialize an incorrect single rank when multiple different
ranks exist. Ensure you reference the same attributes (weight_quantizer,
svdquant_lora_a) and the _detect_svdquant_rank helper (or replace it with a
function that returns the set/validates) so callers can act on the validation
result.
Remove the standalone Qwen test files. The fp8/nvfp4/svdquant cases in test_export_diffusers_hf_ckpt.py already cover the block-range recipe (only transformer_blocks 2..57 quantized), the promoted SVDQuant keys + pre_quant_scale, the NVFP4_SVD quantization_config, and the no-leak check -- matching how SDXL/Flux/Wan are tested in the same file. Core SVDQuant forward/fold is unchanged and remains covered by existing upstream tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…motion Covers svdquant calibration -> _promote_quantizer_tensors_to_module -> clean module-level keys (svdquant_lora_a/b, pre_quant_scale) with the quantizers hidden, plus the post-export cleanup. Runs on CPU in <1s (INT8_SMOOTHQUANT + svdquant on a tiny linear stack). The full NVFP4 end-to-end check remains test_qwen_image_hf_ckpt_export[qwen_nvfp4_svdquant]; svdquant calibration is already covered by test_calib.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/torch/export/test_export_diffusers.py`:
- Around line 132-137: Move the local imports into the module-level import
section: take the symbols "copy", "torch.nn as nn", "modelopt.torch.quantization
as mtq", and "hide_quantizers_from_state_dict" and add them with the other
top-of-file imports (after the existing imports around line ~32), then remove
the in-function imports currently present in the test body; this ensures the
imports are executed at collection time and preserves the same symbol names used
in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0a915fc4-6111-479e-a3c9-18d6c9db6bd4
📒 Files selected for processing (1)
tests/unit/torch/export/test_export_diffusers.py
| import copy | ||
|
|
||
| import torch.nn as nn | ||
|
|
||
| import modelopt.torch.quantization as mtq | ||
| from modelopt.torch.export.diffusers_utils import hide_quantizers_from_state_dict |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Move imports to the top of the file.
Per test coding guidelines, imports inside functions or test methods require explicit justification (e.g., circular imports or optional dependencies like TensorRT-LLM/Megatron-Core). None of these imports (copy, torch.nn, modelopt.torch.quantization, hide_quantizers_from_state_dict) are optional dependencies or resolve circular imports. Moving them to the top ensures import errors surface at collection time instead of mid-test.
📦 Suggested fix
Move these imports to the top of the file with the other imports (after line 32):
from modelopt.torch.export.convert_hf_config import convert_hf_quant_config_format
from modelopt.torch.export.diffusers_utils import generate_diffusion_dummy_inputs
from modelopt.torch.export.unified_export_hf import export_hf_checkpoint
+import copy
+import torch.nn as nn
+import modelopt.torch.quantization as mtq
+from modelopt.torch.export.diffusers_utils import hide_quantizers_from_state_dictThen remove the in-function imports (lines 132-137).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/torch/export/test_export_diffusers.py` around lines 132 - 137,
Move the local imports into the module-level import section: take the symbols
"copy", "torch.nn as nn", "modelopt.torch.quantization as mtq", and
"hide_quantizers_from_state_dict" and add them with the other top-of-file
imports (after the existing imports around line ~32), then remove the
in-function imports currently present in the test body; this ensures the imports
are executed at collection time and preserves the same symbol names used in the
test.
Source: Coding guidelines
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1706 +/- ##
==========================================
- Coverage 77.12% 67.73% -9.40%
==========================================
Files 511 511
Lines 56236 56300 +64
==========================================
- Hits 43370 38132 -5238
- Misses 12866 18168 +5302
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: New feature
Adds Qwen-Image (
Qwen/Qwen-Image,QwenImageTransformer2DModel) to the diffusers quantization example and exports HuggingFace checkpoints in three precisions — FP8, NVFP4, and NVFP4 + SVDQuant — through the unified HF export.--model qwen-image(lazy diffusers import; notrust_remote_code).transformer_blocks, keeping the first 2 / last 2 blocks (and everything outsidetransformer_blocks) in original precision. Applied before calibration so SVDQuant never mutates the excluded blocks. Expressed with the top-levelenableQuantizerCfgEntryfield (disable-all → re-enabletransformer_blocks→ disable first/last-N).weight_quantizer.svdquant_lora_a/b → <module>.svdquant_lora_a/bandinput_quantizer._pre_quant_scale → <module>.pre_quant_scale— with a documentedNVFP4_SVDquantization_config(group_size,has_zero_point: false,pre_quant_scale: true,lora_rank). Core SVDQuant quantization code (modelopt/torch/quantization) is unchanged.onnx_graphsurgeonimport (only needed for--onnx-dir); single-file save for large transformers (the layerwise-metadata post-processing does not support sharded safetensors); andhide_quantizers_from_state_dictnow strips quantizer state from all modules so norm-layer input quantizers no longer leakinput_quantizer._amax.Usage
python examples/diffusers/quantization/quantize.py \ --model qwen-image --override-model-path <Qwen-Image> --model-dtype BFloat16 \ --format fp4 --quant-algo svdquant --lowrank 32 \ --calib-size 64 --n-steps 20 \ --hf-ckpt-dir <out> --sanity-image-path <out>/sanity.png # FP8: --format fp8 --quant-algo max # NVFP4: --format fp4 --quant-algo maxTesting
NVFP4_SVDconfig schema, SVDQuant forward/fold (LoRA stays onweight_quantizer), Qwen dummy-input / strict-QKV-fusion / promotion, pipeline loading, and the diffusers HF-export test for Qwen FP8 / NVFP4 / SVDQuant.tests/examples/diffusers/test_export_diffusers_hf_ckpt.pyis green (SDXL, Flux, Qwen, Wan2.2) — confirms the shared export changes do not regress other models.Qwen/Qwen-Image(~20B): all three formats export valid HF checkpoints — onlytransformer_blocks2..57 quantized, nothing outside, no quantizer/_amaxleak, correctweight_scale(_2)/input_scale, promoted SVDQuant keys (rank-consistent shapes), and the expectedquantization_config— plus a quantized-inference sanity image.Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
All changes are confined to the diffusers example (
examples/diffusers/quantization) plus the shared export path (modelopt/torch/export); the core quantization library is untouched.Follow-up (next step): fused-QKV SVDQuant for sglang / Nunchaku
This export keeps attention
q/k/v(andadd_q/k/v_proj) as separate projections — the diffusers-native layout. That matches sglang's bf16 / FP8 / plain-NVFP4 paths (which also keep QKV separate) and ModelOpt/TRT-LLM consumers, so those load 1:1.sglang's NVFP4-SVDQuant (Nunchaku) path, however, builds a fused
to_qkvwith a single fused rank-r LoRA in Nunchaku-native format (proj_down/proj_up,smooth_factor,wscales/wtscale). Our per-projection tensors (svdquant_lora_a/b+pre_quant_scale; three independent rank-r decompositions) are not directly loadable there — and cannot be fused at load time, because the fp16 weight residual needed to derive a single fused rank-r is not preserved after export.Planned next step: an opt-in fused-QKV SVDQuant export mode that fuses q/k/v before SVDQuant calibration (yielding one rank-r over the fused weight) and emits a Nunchaku-compatible layout, enabling lower-latency fused-QKV inference in sglang. Tracked as a separate follow-up.