Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ] by juhi10071998 · Pull Request #1707 · NVIDIA/Model-Optimizer

juhi10071998 · 2026-06-13T01:04:03Z

What does this PR do?

Type of change: new feature

Adds end-to-end PTQ + HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF _tied_weights_keys. Six source commits + one test commit + one CHANGELOG entry — purely additive for existing modelopt users (non-tied models see no behavioral change).

Source commits:

Onboard DiffusionGemma to hf_ptq.py — substring-list additions in model_type_is_enc_dec and MODEL_NAME_TO_TYPE (so calibration routes through .generate()), and a .sequences unwrap in the preview decode for ModelOutput-returning .generate()s.
MoE experts dedup in _export_fused_experts — when two fused-expert modules share their 3-D source params, alias the per-expert packed weight + scales on cache hit so downstream postprocess_state_dict dedup catches them. ~42% storage reduction on nvfp4_experts_only for tied 26B MoE checkpoints.
Dense Linear dedup in _export_quantized_weight — symmetric to (2) for dense Linears; no-op for nvfp4_experts_only (dense early-returns at QUANTIZATION_NONE).
Opt-in canonical-side reorder (--canonical_tied_naming, default off) — partitions state_dict so canonical-side tied keys iterate before alias-side, letting first-wins dedup keep canonical names.
sync_tied_input_amax — max-merges per-side input_quantizer.amax across tied modules BEFORE export, so single-backbone consumers (vLLM) that load one input_scale per parameter don't clip on either side. Extends commits 2+3's alias loops to include input_scale.
Default *self_conditioning* exclude — adds the diffusion-model self-conditioning wildcard to default_disabled_quantizers.yaml. Companion to PR Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) #1691 which already added the vision-module excludes.

Test commit:

New test fixture (tests/_test_utils/torch/quantization/tied_modules.py) with three factories (make_tied_linear_pair, tie_fused_experts_3d_params, wrap_in_parent_with_tied_keys) + 10 unit tests covering commits 2–5 across tests/unit/torch/export/test_unified_export_hf.py (new) and tests/unit/torch/quantization/plugins/test_fused_experts.py (extended). Pure-Python, CPU-only, ~1s wall total.

Docs commit:

CHANGELOG entry under 0.46 New Features.

Usage

# Standard usage — no API change for non-tied models. Existing recipes still work:
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <ckpt-dir> \
    --qformat nvfp4_experts_only \
    --calib_size 32 \
    --trust_remote_code \
    --export_path <export-dir>

# For tied-weight models (e.g. DiffusionGemma), opt-in to canonical-side
# naming so the exported state_dict uses the canonical (e.g. decoder-side)
# names per HF's _tied_weights_keys declaration:
    --canonical_tied_naming true

Testing

10 new unit tests covering commits 2–5. Pure-Python, CPU-only, ~1s wall total.
Full local unit suite: 2588 passed, 17 intentional skips. 9 pre-existing failures observed in unrelated test files (PEFT/LoRA, ONNX, speculative-decoding) — verified pre-existing via static analysis (none import any file modified by this PR).
End-to-end PTQ export validated on DiffusionGemma v10 via nvfp4_experts_only recipe + --canonical_tied_naming true — produces a 2-shard, ~18 GB safetensors checkpoint with decoder-canonical naming.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ — --canonical_tied_naming is opt-in default off; the dedup/alias logic is cache-based on data_ptr() and no-ops for non-tied models; sync_tied_input_amax no-ops when no two modules share a weight data_ptr; the *self_conditioning* wildcard is a no-op for models without matching module names.
If you copied code from any other sources or added a new PIP dependency: N/A
Did you write any new necessary tests?: ✅ 10 unit tests across two files; fixture helpers shared via _test_utils
Did you update CHANGELOG?: ✅ — new bullet under 0.46 → New Features
Did you get Claude approval on this PR?: pending /claude review

Additional Information

Validated locally against DiffusionGemma v10 (DiffusionGemmaForBlockDiffusion, model_type: diffusion_gemma). Substring patterns are also chosen to match the older DiffusionGemma4ModelForBlockDiffusion / diffusion_gemma4 spelling, so the same code path works on earlier and current model versions.

Summary by CodeRabbit

New Features
- Added support for tied-weight post-training quantization and Hugging Face checkpoint export for encoder-decoder language models
- Implemented buffer aliasing for quantized weight exports to enable downstream state-dict deduplication
- Added optional canonical tied-weight naming flag to align export behavior with Hugging Face standards
- Introduced new default_disabled_quantizers wildcard for self-conditioning model support
Improvements
- Enhanced generation decoding for encoder-decoder models to properly handle model outputs
Tests
- Added comprehensive test coverage for tied-weight quantization and export behavior

…ptq.py DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder + iterative-decoder + self-conditioning + 48-step denoising loop). Four small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end for it; no new entry points needed. Substring patterns are chosen so they ALSO match the previous class name DiffusionGemma4ModelForBlockDiffusion (and previous model_type "diffusion_gemma4"). The model was renamed in transformers mid-development; using "diffusiongemma" / "DiffusionGemma" / "diffusion_gemma" matches both the current and the older class so we don't silently regress on either checkpoint generation. 1. modelopt/torch/utils/dataset_utils.py Add "diffusiongemma" to the substring list in model_type_is_enc_dec. This routes ModelOpt's built-in calibration forward_loop to model.generate() instead of a single model.forward(), so calibration exercises the full 48-step inner denoising loop and sees the entire noise->clean activation distribution. 2. modelopt/torch/export/model_utils.py Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE, *before* "Gemma". get_model_type does substring matching; "gemma" is a substring of "diffusiongemma" so without this entry the class is silently mis-classified as plain "gemma" -- which then mis-routes downstream model-type-dependent logic. 3. examples/llm_ptq/hf_ptq.py (output_decode) DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput ModelOutput dataclass, not a bare token tensor. The preview decode code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences at the top of output_decode so both the enc-dec and AR slicing branches work uniformly. Generic shim -- helps any model whose .generate returns a ModelOutput with a .sequences attribute. 4. examples/llm_ptq/example_utils.py (is_enc_dec) Comment-only clarification on the semantics. is_enc_dec controls how hf_ptq.py:output_decode slices the preview result (whether to strip the prompt prefix). For T5/BART/Whisper generate returns only new tokens. For DiffusionGemma it returns prompt+canvas concatenated, so it belongs with the AR slicing path and stays out of this list. No behavioral change; documents the prior intent so future contributors don't re-add diffusion_gemma here. End-to-end smoke test: python hf_ptq.py --pyt_ckpt_path <local northbloom checkpoint> \ --qformat nvfp4_experts_only \ --calib_size 32 --trust_remote_code \ --export_path <somewhere> produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

When a model has multiple fused-experts modules whose 3-D source params share storage via HF _tied_weights_keys (e.g. the encoder and decoder transformer stacks of a block-diffusion encoder-decoder LLM like DiffusionGemma4 / northbloom), the unpacking loop in _export_fused_experts ordinarily creates fresh per-expert tensors for each call — destroying the tied identity and writing two full sets of expert weights + scales to disk. This adds a function-local cache keyed by (gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the existing unpacking path runs unchanged. On a cache hit, after the normal unpacking completes, the per-expert weight / weight_scale / weight_scale_2 buffers are re-pointed at the prior module's tensors so they share storage. The downstream postprocess_state_dict data_ptr()-based dedup then catches them and drops the duplicates from the saved checkpoint. input_scale is intentionally NOT aliased: encoder and decoder paths have legitimately different activation distributions (verified across all 60 tied pairs of a 512-prompt calibration — down_proj_input divergence median 2.26x, max 18.4x), so each side keeps its own per-side calibrated scale. Model-agnostic: no name regex, no model-type lookup. Cache miss falls through to existing behavior, so non-tied models are unaffected. Empirical result on DiffusionGemma4 26B at nvfp4_experts_only: safetensors 28.43 GB -> 16.47 GB (-42%) shards 4 -> 2 decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each decoder input_scale entries: 3840 -> 3840 (kept per-side, as intended) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Symmetric companion to the data_ptr-cache alias added to _export_fused_experts in the prior commit. Catches plain nn.Linear modules whose .weight Parameters are tied via HF tie_weights() (e.g. encoder and decoder attention QKV/O, MLP gate/up/down, router proj of an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they are quantized under recipes that route through _export_quantized_weight. Mechanism: capture weight.data_ptr() at the top of the function, before the setattr further down wraps the packed bytes in a fresh nn.Parameter (which destroys the tie). At the end of the function, consult a function-local cache keyed by that captured data_ptr. On cache miss: register this sub_module as the canonical owner. On cache hit (a previously-processed module shared the same source weight memory): alias .weight, weight_scale, weight_scale_2 to the prior module's tensors so downstream data_ptr-based dedup in postprocess_state_dict drops the duplicates. input_scale is intentionally NOT aliased — calibration legitimately diverges per-side (verified in Q2 analysis: down_proj_input ratio up to 18x across 60 tied pairs on the northbloom DiffusionGemma4 model). Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach this function but have fresh data_ptrs from upstream slice+contiguous so cache misses always). Under full nvfp4 it fires for every tied dense Linear pair. Same safety guarantees as the existing data_ptr-based dedup: cannot false-positive because it only aliases when source memory was already shared by the model author (via tie_weights or equivalent). No name regex, no _tied_weights_keys lookup, no model introspection. Empirical results on DiffusionGemma4 26B (calib_size=8 smoke): nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior experts-only export; this patch a no-op as expected) nvfp4 (full): ~27 GB est -> 14.24 GB (-12.7 GB; both patches firing on disjoint module sets; per-name dedup verified in index) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Adds an opt-in pass (--canonical_tied_naming, default off) that reorders the state_dict before postprocess_state_dict so that keys matching the canonical side of HF's _tied_weights_keys declaration iterate before their aliases. The existing first-wins data_ptr dedup at quant_utils.py:1148-1163 then drops the alias names, leaving the canonical names in the exported safetensors. Motivation: for models like DiffusionGemma4 (northbloom), HF declares {alias: canonical} via _tied_weights_keys, where the encoder side is the alias and the decoder side is canonical. The original HF safetensors index uses decoder-prefixed names for all tied weights (661/691 keys vs 30/691 encoder-only layer_scalar keys). The single-backbone vLLM mockup loader strips both prefixes to model.* and relies on this canonical naming. The default modelopt export today walks the model in registration order (encoder before decoder, per the model's __init__ order), so encoder names win in the first-wins dedup. The exported checkpoint thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards relative to the upstream HF naming and to what downstream consumers expect. Implementation. _tied_weights_keys is declared per model class with paths relative to that class. In nested models (e.g. DiffusionGemma4) multiple submodules declare their own ties: the outer wrapper at DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model at model.model declares the much larger encoder<->decoder dict with paths relative to itself. _collect_canonical_tied_patterns walks model.named_modules() and collects every dict-style _tied_weights_keys declaration, prefixing each pattern with the submodule's qualified path so the regexes match against root-level state_dict keys. Without the prefix, the inner dict's patterns (which lack a "model." prefix) silently fail to match keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug that would cause only the outer dict's single entry (embed_tokens) to flip in the dedup. _reorder_canonical_first then partitions the state_dict into head (canonical-pattern matches) and tail (everything else), preserving original order within each partition. head.update(tail) yields a single dict with canonical keys first. The downstream dedup loop iterates this in insertion order and records the canonical names in seen_tensors; alias names then arrive as duplicates and are dropped. Scope of behavior change: - Models with no _tied_weights_keys, or only legacy list-of-strings declarations: _collect returns an empty pattern list, helper short-circuits, state_dict returned unchanged. Zero effect on any existing modelopt user. - Models with dict-style declarations (e.g. DiffusionGemma4): when the flag is set, canonical-side names win. When the flag is unset (default), behavior is identical to before this commit. No changes to dedup logic itself, to the existing tied-weight alias patches in _export_quantized_weight and _export_fused_experts, or to _process_quantized_modules iteration. Strictly additive. Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size 32: 35 127 encoder keys removed (decoder kept), 0 decoder keys removed; layer_scalar (30 encoder-only keys) and per-side input_scale (11 520 keys, intentionally not deduped) unaffected; total safetensors bytes 17.68 GB matching the prior export to within 500 KB of safetensors-metadata-ordering noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Companion piece to the existing tied-weight alias patches in _export_fused_experts (commit 8b00e85) and _export_quantized_weight (commit e8c36b0), which already alias bit-identical weight / weight_scale / weight_scale_2 between tied modules but leave input_scale per-side. This commit closes the loop on input_scale so consumers that load a single canonical scale per Linear (e.g. vLLM's single-backbone DiffusionGemma4 mockup) see a value consistent across all tied sides. Implementation has two parts. 1. New sync_tied_input_amax(model) helper. Walks named_modules(), groups by source weight data_ptr (same signature our existing dedup patches use), and max-merges input_quantizer.amax across each group. Uses the canonical 4-line idiom shared with preprocess_linear_fusion (quant_utils.py:1394-1401) and sync_moe_gate_up_amax (layer_utils.py:1197): merged = torch.max(torch.stack([q.amax for q in qs])) for q in qs: q.amax = merged.clone() Handles both dense Linears (keyed by weight.data_ptr) and fused MoE modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging gate_up_proj_input_quantizer and down_proj_input_quantizer independently across the group). Scalar-only, matching preprocess_linear_fusion's contract. Called unconditionally from _export_transformers_checkpoint after sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the merged amax flows into _export_quantized_weight's input_scale derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when applicable, no-op otherwise" convention. 2. Extend the existing tied-weight alias loops in _export_quantized_weight and _export_fused_experts to include input_scale alongside weight_scale / weight_scale_2. Before this commit those loops intentionally skipped input_scale because encoder/decoder amaxes legitimately differed (Q2 analysis showed up to 18x divergence for down_proj_input on v1). With sync_tied_input_amax in place, both sides now derive bit-identical input_scale values; aliasing the buffers is safe and lets the existing data_ptr dedup in postprocess_state_dict collapse them so only one canonical entry per Linear survives in the exported safetensors. Also extends the Q-B canonical-side reorder pass added in commit 837768f with an auto-derived side-substring matcher. HF's _tied_weights_keys regex patterns target the pre-export module structure (fused gate_up_proj), but after _export_fused_experts unpacks them into per-expert gate_proj/up_proj/down_proj submodules, post-export keys like ...experts.Y.gate_proj.input_scale are not covered by HF's regex. Without the substring fallback, those keys fell through Q-B to the "alias-first" partition, so when the new input_scale alias step shared data_ptrs, the encoder name won the dedup instead of the decoder name. _collect_canonical_tied_patterns now returns (patterns, side_substrings). The side_substrings list is auto-derived from each _tied_weights_keys entry as the set of dot-separated tokens that appear in canonical patterns but not in alias patterns. For DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern contains "decoder", no alias pattern does. _reorder_canonical_first treats a key as canonical if it matches a regex pattern OR contains a side substring as a proper path component (bordered by "." or at start/end). The path-component requirement avoids false positives from accidental name collisions. Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32: the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries that the prior export carried are removed; the 11 520 decoder-side entries remain with the merged amax-derived value. Total bytes drops by ~1 MB (scalar entries). Other tied-tensor entries (weight, weight_scale, weight_scale_2) and encoder-only entries (layer_scalar, 30 keys) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

The diffusion self-conditioning network (block-diffusion models like DiffusionGemma) is text-only and not exercised by typical calibration data. Without exclusion its TensorQuantizers never see input, never set _amax, and export crashes at _export_quantized_weight: AttributeError: 'TensorQuantizer' object has no attribute '_amax' Companion to the upstream vision-tower / visual / embed_vision excludes already in this unit (PR NVIDIA#1691). Pattern is a no-op for non-diffusion models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Adds the tied-modules test fixture plus 10 unit tests covering the tied-weight machinery introduced earlier in this series. tests/_test_utils/torch/quantization/tied_modules.py (new): Three small factory helpers shared by the unit tests: - make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is shared via setattr (mimics HF tie_weights() after __init__). - tie_fused_experts_3d_params(enc, dec) -- in-place tie of gate_up_proj / down_proj between two fused-experts modules (paired with the existing _SyntheticFusedExperts fixture). - wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent nn.Module with HF-style _tied_weights_keys (dict-style for the canonical case, list-style for the legacy negative case). Each factory asserts post-conditions on the tie so a misuse fails loudly at construction. tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests Commit f3e9543ab -- canonical-side reorder: - dict-style _tied_weights_keys yields patterns + canonical substrings - list-style yields no canonical info (reorder becomes a no-op) - _reorder_canonical_first puts decoder-side keys ahead of encoder-side keys Commit 3fb3ba053 -- sync_tied_input_amax: - tied Linears with divergent amaxes (2.0 vs 5.0) get both sides overwritten with the elementwise max (5.0) - untied Linears keep per-side amaxes (no-op when there's no tie) Commit 29674a7e1 -- dense Linear tied-weight dedup: - tied Linears share data_ptr for packed .weight + scale buffers - untied Linears keep independent data_ptrs - asymmetric quant: unquantized side early-returns at QUANTIZATION_NONE, stays at the original shared Parameter tests/unit/torch/quantization/plugins/test_fused_experts.py (extended): 2 tests Commit 10a8fdbd5 -- MoE experts dedup: - two _SyntheticSparseMoeBlock instances with tied 3-D source params share data_ptr across every per-expert buffer - untied counterparts keep independent per-expert data_ptrs Pure-Python; CPU-only; ~1s wall total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Adds a New Features bullet under 0.46 covering the tied-weight dedup, canonical-side reorder, sync_tied_input_amax helper, and the *self_conditioning* default exclude introduced earlier in this series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

coderabbitai · 2026-06-13T01:04:17Z

📝 Walkthrough

Walkthrough

This PR adds post-quantization-training (PTQ) and HuggingFace checkpoint export support for tied-weight encoder-decoder diffusion models. It recognizes DiffusionGemma, introduces optional canonical-side tied-weight naming, implements tensor aliasing for quantized weight deduplication, synchronizes tied input quantizer scales, and provides comprehensive test coverage.

Changes

Tied-weight PTQ and HF export for encoder-decoder models

Layer / File(s)	Summary
DiffusionGemma model type recognition `modelopt/torch/export/model_utils.py`, `modelopt/torch/utils/dataset_utils.py`, `examples/llm_ptq/example_utils.py`	DiffusionGemma is added to model-type mappings and encoder-decoder model lists with substring-matching order comments to avoid false matches.
CLI integration and generation output handling `examples/llm_ptq/hf_ptq.py`	Adds `--canonical_tied_naming` CLI flag to opt into HF canonical-side tied-weight ordering; unwraps `ModelOutput` objects from `.generate()` so diffusion preview decoding operates on token tensors consistently.
Test utilities for tied-weight scenarios `tests/_test_utils/torch/quantization/tied_modules.py`	Introduces factory functions to construct tied linear pairs, fused-expert tied parameters, and parent modules with HF-style `_tied_weights_keys` declarations for repeatable test setup.
Core tied-weight quantized export logic `modelopt/torch/export/unified_export_hf.py`	Implements tied-weight deduplication via source tensor tracking and cached parameter aliasing; adds canonical tied-pattern detection and optional state-dict reordering; introduces `sync_tied_input_amax()` to max-merge input quantizer scales across tied modules.
Fused experts tied-weight export deduplication `modelopt/torch/export/moe_utils.py`	Extends `_export_fused_experts` to capture source fused-expert identities before unpacking and alias per-expert projection weights/scales across tied fused-expert modules via module-level cache.
Test coverage for tied-weight unified export `tests/unit/torch/export/test_unified_export_hf.py`	Verifies canonical pattern detection, state-dict reordering, input amax synchronization, and quantized-weight aliasing for tied and untied scenarios.
Test coverage for fused experts tied deduplication `tests/unit/torch/quantization/plugins/test_fused_experts.py`	Validates that tied fused-expert per-expert projections and scales share storage across modules while untied experts remain independent.
Configuration and documentation updates `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`, `CHANGELOG.rst`	Adds `self_conditioning` disabled quantizer rule for diffusion models and documents all tied-weight features.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1327: Both PRs modify the HF quantized-weight export logic in modelopt/torch/export/unified_export_hf.py—main PR adds tied-weight canonical/aliasing dedup in _export_quantized_weight, while retrieved PR extends _export_quantized_weight for nvfp4_w4a16 and embedding export eligibility—so they're code-level related via overlapping changes in the same function.

Suggested reviewers

realAsma
meenchen
sugunav14

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 79.07% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately reflects the main change: adding tied-weight PTQ export support for encoder-decoder block-diffusion models like DiffusionGemma, which is the primary focus of all file changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR diff has no CRITICAL issues: no torch.load(weights_only=False), numpy.load(allow_pickle=True), trust_remote_code=True, eval/exec, or '# nosec'. citeturn3view1turn3view3turn3view0turn3view5...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-13T01:12:59Z

Codecov Report

❌ Patch coverage is 86.60714% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.27%. Comparing base (2640551) to head (0543907).

Files with missing lines	Patch %	Lines
modelopt/torch/export/unified_export_hf.py	87.20%	11 Missing ⚠️
modelopt/torch/export/moe_utils.py	84.00%	4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (2640551) and HEAD (0543907). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (2640551) HEAD (0543907)

unit 2 1

gpu 4 3

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1707      +/-   ##
==========================================
- Coverage   76.41%   68.27%   -8.14%     
==========================================
  Files         511      511              
  Lines       56236    56346     +110     
==========================================
- Hits        42970    38473    -4497     
- Misses      13266    17873    +4607

Flag	Coverage Δ
examples	`41.82% <33.03%> (+2.35%)`	⬆️
gpu	`31.58% <2.67%> (-26.80%)`	⬇️
regression	`14.67% <2.67%> (+0.04%)`	⬆️
unit	`54.49% <80.35%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/unit/torch/quantization/plugins/test_fused_experts.py (1)

561-568: 💤 Low value

Move import to top of file or add justification comment.

The import at line 566 is inside a function without a comment explaining why. Per CONTRIBUTING.md, imports belong at the top so errors surface at collection time. _export_quantized_weight is from a core modelopt module with no circular-import or optional-dependency concern.

♻️ Suggested fix

Add the import near the other modelopt.torch.export imports at the top of the file:

 from modelopt.torch.export.moe_utils import _export_fused_experts
 from modelopt.torch.export.quant_utils import get_quant_config
+from modelopt.torch.export.unified_export_hf import _export_quantized_weight

Then simplify the helper:

 def _clear_fused_experts_caches():
     """Clear function-static alias caches in both export entry points."""
     _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None)
-    # _export_fused_experts internally calls _export_quantized_weight per per-expert
-    # wrapper; clear that cache too so each test sees a pristine state.
-    from modelopt.torch.export.unified_export_hf import _export_quantized_weight
-
     _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 561
- 568, Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 45-51: The docstring for the tied-experts dedup block contradicts
the implementation by saying "input_scale is left per-side" while the code in
moe_utils.py aliases input_scale alongside weight_scale and weight_scale_2 (due
to sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1545-1547: The call to _export_transformers_checkpoint is dropping
**kwargs (so options like accelerator are ignored); update the call in
unified_export_hf.py where post_state_dict, hf_quant_config =
_export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.
- Around line 721-744: The tied-weight alias cache (_tied_weight_alias_cache) is
stored on the function object of _export_quantized_weight and persists across
exports; reset it at the start of each export invocation to avoid stale aliases:
in the _export_quantized_weight function (or the outer export entrypoint that
calls it) initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 561-568: Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f4075f4-f09e-491d-b111-ffe09bfe2a7a

📥 Commits

Reviewing files that changed from the base of the PR and between 2640551 and 0543907.

📒 Files selected for processing (11)

CHANGELOG.rst
examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/export/model_utils.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/utils/dataset_utils.py
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
tests/_test_utils/torch/quantization/tied_modules.py
tests/unit/torch/export/test_unified_export_hf.py
tests/unit/torch/quantization/plugins/test_fused_experts.py

coderabbitai · 2026-06-13T01:15:00Z

+
+    Tied-experts dedup: when multiple fused-expert modules share their 3-D
+    source params via HF ``_tied_weights_keys``, the unpacking creates fresh
+    per-expert tensors that break the tie. We cache the source ``data_ptr()``
+    at entry and on a later cache hit alias the per-expert ``weight`` /
+    ``weight_scale`` / ``weight_scale_2`` back to the prior module so
+    downstream dedup catches them. ``input_scale`` is left per-side.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Docstring contradicts implementation for input_scale aliasing.

Line 51 states input_scale is left per-side, but line 221 explicitly aliases input_scale along with weight_scale and weight_scale_2. The implementation comment at lines 195-198 correctly explains that input_scale IS aliased because sync_tied_input_amax runs earlier.

📝 Suggested docstring fix

Tied-experts dedup: when multiple fused-expert modules share their 3-D source params via HF ``_tied_weights_keys``, the unpacking creates fresh per-expert tensors that break the tie. We cache the source ``data_ptr()`` at entry and on a later cache hit alias the per-expert ``weight`` / - ``weight_scale`` / ``weight_scale_2`` back to the prior module so - downstream dedup catches them. ``input_scale`` is left per-side. + ``weight_scale`` / ``weight_scale_2`` / ``input_scale`` back to the prior + module so downstream dedup catches them. ``input_scale`` aliasing is safe + because ``sync_tied_input_amax`` runs earlier and max-merges the shared + input_quantizer amaxes, so both sides derive bit-identical values.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/moe_utils.py` around lines 45 - 51, The docstring for the tied-experts dedup block contradicts the implementation by saying "input_scale is left per-side" while the code in moe_utils.py aliases input_scale alongside weight_scale and weight_scale_2 (due to sync_tied_input_amax running earlier); update the docstring to state that input_scale is aliased too and mention the reason (sync_tied_input_amax runs prior), keeping the rest of the explanation about caching data_ptr() and aliasing behavior intact so docstring matches the implementation.

coderabbitai · 2026-06-13T01:15:00Z

+    # Tied-weight dedup: if a previously-processed module shared the same
+    # source weight memory, alias the packed weight + scale buffers so the
+    # downstream data_ptr dedup in postprocess_state_dict can collapse them.
+    # input_scale is safe to alias because sync_tied_input_amax (earlier in
+    # this export) already max-merged the per-side amaxes.
+    _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {})
+    _prior = _cache.get(_tied_source_data_ptr)
+    if _prior is not None and _prior is not sub_module:
+        if hasattr(_prior, weight_name):
+            setattr(sub_module, weight_name, getattr(_prior, weight_name))
+        for _attr in (
+            quantizer_attrs.weight_scale,
+            quantizer_attrs.weight_scale_2,
+            quantizer_attrs.input_scale,
+        ):
+            if _attr is None or not hasattr(_prior, _attr):
+                continue
+            if _attr in sub_module._buffers:
+                del sub_module._buffers[_attr]
+            elif hasattr(sub_module, _attr):
+                delattr(sub_module, _attr)
+            sub_module.register_buffer(_attr, getattr(_prior, _attr))
+    else:
+        _cache[_tied_source_data_ptr] = sub_module


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset tied-weight alias cache per export invocation.

_tied_weight_alias_cache is function-static and never cleared in the export path. Across multiple exports in one process, recycled data_ptr() values can hit stale entries and alias to modules from a prior export, which can corrupt checkpoint contents and retain unnecessary module references.

💡 Suggested fix

def _process_quantized_modules( model: nn.Module, dtype: torch.dtype, is_modelopt_qlora: bool = False, ) -> None: + # Isolate aliasing state to a single export run. + _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None) + """Process all quantized modules in model, export weights in-place.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_hf.py` around lines 721 - 744, The tied-weight alias cache (_tied_weight_alias_cache) is stored on the function object of _export_quantized_weight and persists across exports; reset it at the start of each export invocation to avoid stale aliases: in the _export_quantized_weight function (or the outer export entrypoint that calls it) initialize or clear _export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache and no stale module references are reused.

coderabbitai · 2026-06-13T01:15:00Z

+        post_state_dict, hf_quant_config = _export_transformers_checkpoint(
+            model, dtype, canonical_tied_naming=canonical_tied_naming
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Forward **kwargs to preserve transformer export contract.

The new call drops **kwargs, so options consumed inside _export_transformers_checkpoint (e.g., accelerator) are silently ignored. That can break distributed state-dict gathering and produce incomplete exports.

💡 Suggested fix

- post_state_dict, hf_quant_config = _export_transformers_checkpoint( - model, dtype, canonical_tied_naming=canonical_tied_naming - ) + post_state_dict, hf_quant_config = _export_transformers_checkpoint( + model, + dtype, + canonical_tied_naming=canonical_tied_naming, + **kwargs, + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_hf.py` around lines 1545 - 1547, The call to _export_transformers_checkpoint is dropping **kwargs (so options like accelerator are ignored); update the call in unified_export_hf.py where post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype, canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs (e.g., pass **kwargs alongside the existing named args) so that options consumed inside _export_transformers_checkpoint (such as accelerator) are preserved during export; keep the canonical_tied_naming arg as-is when forwarding **kwargs.

juhi10071998 and others added 8 commits June 13, 2026 00:46

juhi10071998 requested review from a team as code owners June 13, 2026 01:04

juhi10071998 requested review from realAsma and sugunav14 June 13, 2026 01:04

juhi10071998 changed the title ~~Add tied-weight PTQ export support for block-diffusion encoder-decoder models (DiffusionGemma)~~ Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ] Jun 13, 2026

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707

Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707
juhi10071998 wants to merge 8 commits into
NVIDIA:mainfrom
juhi10071998:northbloom

juhi10071998 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Uh oh!

coderabbitai Bot Jun 13, 2026

Uh oh!

coderabbitai Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

juhi10071998 commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

juhi10071998 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading