Skip to content

Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707

Open
juhi10071998 wants to merge 8 commits into
NVIDIA:mainfrom
juhi10071998:northbloom
Open

Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707
juhi10071998 wants to merge 8 commits into
NVIDIA:mainfrom
juhi10071998:northbloom

Conversation

@juhi10071998

@juhi10071998 juhi10071998 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: new feature

Adds end-to-end PTQ + HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF _tied_weights_keys. Six source commits + one test commit + one CHANGELOG entry — purely additive for existing modelopt users (non-tied models see no behavioral change).

Source commits:

  1. Onboard DiffusionGemma to hf_ptq.py — substring-list additions in model_type_is_enc_dec and MODEL_NAME_TO_TYPE (so calibration routes through .generate()), and a .sequences unwrap in the preview decode for ModelOutput-returning .generate()s.

  2. MoE experts dedup in _export_fused_experts — when two fused-expert modules share their 3-D source params, alias the per-expert packed weight + scales on cache hit so downstream postprocess_state_dict dedup catches them. ~42% storage reduction on nvfp4_experts_only for tied 26B MoE checkpoints.

  3. Dense Linear dedup in _export_quantized_weight — symmetric to (2) for dense Linears; no-op for nvfp4_experts_only (dense early-returns at QUANTIZATION_NONE).

  4. Opt-in canonical-side reorder (--canonical_tied_naming, default off) — partitions state_dict so canonical-side tied keys iterate before alias-side, letting first-wins dedup keep canonical names.

  5. sync_tied_input_amax — max-merges per-side input_quantizer.amax across tied modules BEFORE export, so single-backbone consumers (vLLM) that load one input_scale per parameter don't clip on either side. Extends commits 2+3's alias loops to include input_scale.

  6. Default *self_conditioning* exclude — adds the diffusion-model self-conditioning wildcard to default_disabled_quantizers.yaml. Companion to PR Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) #1691 which already added the vision-module excludes.

Test commit:

  1. New test fixture (tests/_test_utils/torch/quantization/tied_modules.py) with three factories (make_tied_linear_pair, tie_fused_experts_3d_params, wrap_in_parent_with_tied_keys) + 10 unit tests covering commits 2–5 across tests/unit/torch/export/test_unified_export_hf.py (new) and tests/unit/torch/quantization/plugins/test_fused_experts.py (extended). Pure-Python, CPU-only, ~1s wall total.

Docs commit:

  1. CHANGELOG entry under 0.46 New Features.

Usage

# Standard usage — no API change for non-tied models. Existing recipes still work:
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <ckpt-dir> \
    --qformat nvfp4_experts_only \
    --calib_size 32 \
    --trust_remote_code \
    --export_path <export-dir>

# For tied-weight models (e.g. DiffusionGemma), opt-in to canonical-side
# naming so the exported state_dict uses the canonical (e.g. decoder-side)
# names per HF's _tied_weights_keys declaration:
    --canonical_tied_naming true

Testing

  • 10 new unit tests covering commits 2–5. Pure-Python, CPU-only, ~1s wall total.
  • Full local unit suite: 2588 passed, 17 intentional skips. 9 pre-existing failures observed in unrelated test files (PEFT/LoRA, ONNX, speculative-decoding) — verified pre-existing via static analysis (none import any file modified by this PR).
  • End-to-end PTQ export validated on DiffusionGemma v10 via nvfp4_experts_only recipe + --canonical_tied_naming true — produces a 2-shard, ~18 GB safetensors checkpoint with decoder-canonical naming.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ — --canonical_tied_naming is opt-in default off; the dedup/alias logic is cache-based on data_ptr() and no-ops for non-tied models; sync_tied_input_amax no-ops when no two modules share a weight data_ptr; the *self_conditioning* wildcard is a no-op for models without matching module names.
  • If you copied code from any other sources or added a new PIP dependency: N/A
  • Did you write any new necessary tests?: ✅ 10 unit tests across two files; fixture helpers shared via _test_utils
  • Did you update CHANGELOG?: ✅ — new bullet under 0.46 → New Features
  • Did you get Claude approval on this PR?: pending /claude review

Additional Information

Validated locally against DiffusionGemma v10 (DiffusionGemmaForBlockDiffusion, model_type: diffusion_gemma). Substring patterns are also chosen to match the older DiffusionGemma4ModelForBlockDiffusion / diffusion_gemma4 spelling, so the same code path works on earlier and current model versions.

Summary by CodeRabbit

  • New Features

    • Added support for tied-weight post-training quantization and Hugging Face checkpoint export for encoder-decoder language models
    • Implemented buffer aliasing for quantized weight exports to enable downstream state-dict deduplication
    • Added optional canonical tied-weight naming flag to align export behavior with Hugging Face standards
    • Introduced new default_disabled_quantizers wildcard for self-conditioning model support
  • Improvements

    • Enhanced generation decoding for encoder-decoder models to properly handle model outputs
  • Tests

    • Added comprehensive test coverage for tied-weight quantization and export behavior

juhi10071998 and others added 8 commits June 13, 2026 00:46
…ptq.py

DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion
text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder +
iterative-decoder + self-conditioning + 48-step denoising loop). Four
small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end
for it; no new entry points needed.

Substring patterns are chosen so they ALSO match the previous class
name DiffusionGemma4ModelForBlockDiffusion (and previous model_type
"diffusion_gemma4"). The model was renamed in transformers
mid-development; using "diffusiongemma" / "DiffusionGemma" /
"diffusion_gemma" matches both the current and the older class so we
don't silently regress on either checkpoint generation.

1. modelopt/torch/utils/dataset_utils.py
   Add "diffusiongemma" to the substring list in model_type_is_enc_dec.
   This routes ModelOpt's built-in calibration forward_loop to
   model.generate() instead of a single model.forward(), so calibration
   exercises the full 48-step inner denoising loop and sees the entire
   noise->clean activation distribution.

2. modelopt/torch/export/model_utils.py
   Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE,
   *before* "Gemma". get_model_type does substring matching; "gemma" is
   a substring of "diffusiongemma" so without this entry the class is
   silently mis-classified as plain "gemma" -- which then mis-routes
   downstream model-type-dependent logic.

3. examples/llm_ptq/hf_ptq.py (output_decode)
   DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput
   ModelOutput dataclass, not a bare token tensor. The preview decode
   code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences
   at the top of output_decode so both the enc-dec and AR slicing
   branches work uniformly. Generic shim -- helps any model whose
   .generate returns a ModelOutput with a .sequences attribute.

4. examples/llm_ptq/example_utils.py (is_enc_dec)
   Comment-only clarification on the semantics. is_enc_dec controls
   how hf_ptq.py:output_decode slices the preview result (whether to
   strip the prompt prefix). For T5/BART/Whisper generate returns only
   new tokens. For DiffusionGemma it returns prompt+canvas
   concatenated, so it belongs with the AR slicing path and stays out
   of this list. No behavioral change; documents the prior intent so
   future contributors don't re-add diffusion_gemma here.

End-to-end smoke test:
  python hf_ptq.py --pyt_ckpt_path <local northbloom checkpoint> \
                   --qformat nvfp4_experts_only \
                   --calib_size 32 --trust_remote_code \
                   --export_path <somewhere>
produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
When a model has multiple fused-experts modules whose 3-D source params
share storage via HF _tied_weights_keys (e.g. the encoder and decoder
transformer stacks of a block-diffusion encoder-decoder LLM like
DiffusionGemma4 / northbloom), the unpacking loop in
_export_fused_experts ordinarily creates fresh per-expert tensors for
each call — destroying the tied identity and writing two full sets of
expert weights + scales to disk.

This adds a function-local cache keyed by
(gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the
existing unpacking path runs unchanged. On a cache hit, after the
normal unpacking completes, the per-expert weight / weight_scale /
weight_scale_2 buffers are re-pointed at the prior module's tensors
so they share storage. The downstream postprocess_state_dict
data_ptr()-based dedup then catches them and drops the duplicates
from the saved checkpoint.

input_scale is intentionally NOT aliased: encoder and decoder paths
have legitimately different activation distributions (verified across
all 60 tied pairs of a 512-prompt calibration — down_proj_input
divergence median 2.26x, max 18.4x), so each side keeps its own
per-side calibrated scale.

Model-agnostic: no name regex, no model-type lookup. Cache miss falls
through to existing behavior, so non-tied models are unaffected.

Empirical result on DiffusionGemma4 26B at nvfp4_experts_only:
  safetensors    28.43 GB -> 16.47 GB  (-42%)
  shards               4 -> 2
  decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each
  decoder input_scale entries: 3840 -> 3840 (kept per-side, as
  intended)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Symmetric companion to the data_ptr-cache alias added to
_export_fused_experts in the prior commit. Catches plain nn.Linear
modules whose .weight Parameters are tied via HF tie_weights() (e.g.
encoder and decoder attention QKV/O, MLP gate/up/down, router proj of
an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they
are quantized under recipes that route through _export_quantized_weight.

Mechanism: capture weight.data_ptr() at the top of the function,
before the setattr further down wraps the packed bytes in a fresh
nn.Parameter (which destroys the tie). At the end of the function,
consult a function-local cache keyed by that captured data_ptr. On
cache miss: register this sub_module as the canonical owner. On cache
hit (a previously-processed module shared the same source weight
memory): alias .weight, weight_scale, weight_scale_2 to the prior
module's tensors so downstream data_ptr-based dedup in
postprocess_state_dict drops the duplicates. input_scale is
intentionally NOT aliased — calibration legitimately diverges per-side
(verified in Q2 analysis: down_proj_input ratio up to 18x across 60
tied pairs on the northbloom DiffusionGemma4 model).

Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense
Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach
this function but have fresh data_ptrs from upstream slice+contiguous
so cache misses always). Under full nvfp4 it fires for every tied
dense Linear pair.

Same safety guarantees as the existing data_ptr-based dedup: cannot
false-positive because it only aliases when source memory was already
shared by the model author (via tie_weights or equivalent). No name
regex, no _tied_weights_keys lookup, no model introspection.

Empirical results on DiffusionGemma4 26B (calib_size=8 smoke):
  nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior
                                            experts-only export; this
                                            patch a no-op as expected)
  nvfp4 (full):       ~27 GB est -> 14.24 GB (-12.7 GB; both patches
                                              firing on disjoint
                                              module sets; per-name
                                              dedup verified in index)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds an opt-in pass (--canonical_tied_naming, default off) that reorders
the state_dict before postprocess_state_dict so that keys matching the
canonical side of HF's _tied_weights_keys declaration iterate before
their aliases. The existing first-wins data_ptr dedup at
quant_utils.py:1148-1163 then drops the alias names, leaving the
canonical names in the exported safetensors.

Motivation: for models like DiffusionGemma4 (northbloom), HF declares
{alias: canonical} via _tied_weights_keys, where the encoder side is
the alias and the decoder side is canonical. The original HF
safetensors index uses decoder-prefixed names for all tied weights
(661/691 keys vs 30/691 encoder-only layer_scalar keys). The
single-backbone vLLM mockup loader strips both prefixes to model.* and
relies on this canonical naming.

The default modelopt export today walks the model in registration
order (encoder before decoder, per the model's __init__ order), so
encoder names win in the first-wins dedup. The exported checkpoint
thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards
relative to the upstream HF naming and to what downstream consumers
expect.

Implementation. _tied_weights_keys is declared per model class with
paths relative to that class. In nested models (e.g. DiffusionGemma4)
multiple submodules declare their own ties: the outer wrapper at
DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to
model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model
at model.model declares the much larger encoder<->decoder dict with
paths relative to itself.

_collect_canonical_tied_patterns walks model.named_modules() and
collects every dict-style _tied_weights_keys declaration, prefixing
each pattern with the submodule's qualified path so the regexes match
against root-level state_dict keys. Without the prefix, the inner
dict's patterns (which lack a "model." prefix) silently fail to match
keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug
that would cause only the outer dict's single entry (embed_tokens) to
flip in the dedup.

_reorder_canonical_first then partitions the state_dict into head
(canonical-pattern matches) and tail (everything else), preserving
original order within each partition. head.update(tail) yields a
single dict with canonical keys first. The downstream dedup loop
iterates this in insertion order and records the canonical names in
seen_tensors; alias names then arrive as duplicates and are dropped.

Scope of behavior change:
- Models with no _tied_weights_keys, or only legacy list-of-strings
  declarations: _collect returns an empty pattern list, helper
  short-circuits, state_dict returned unchanged. Zero effect on any
  existing modelopt user.
- Models with dict-style declarations (e.g. DiffusionGemma4): when
  the flag is set, canonical-side names win. When the flag is unset
  (default), behavior is identical to before this commit.

No changes to dedup logic itself, to the existing tied-weight alias
patches in _export_quantized_weight and _export_fused_experts, or to
_process_quantized_modules iteration. Strictly additive.

Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size
32: 35 127 encoder keys removed (decoder kept), 0 decoder keys
removed; layer_scalar (30 encoder-only keys) and per-side input_scale
(11 520 keys, intentionally not deduped) unaffected; total safetensors
bytes 17.68 GB matching the prior export to within 500 KB of
safetensors-metadata-ordering noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Companion piece to the existing tied-weight alias patches in
_export_fused_experts (commit 8b00e85) and _export_quantized_weight
(commit e8c36b0), which already alias bit-identical weight /
weight_scale / weight_scale_2 between tied modules but leave input_scale
per-side. This commit closes the loop on input_scale so consumers that
load a single canonical scale per Linear (e.g. vLLM's single-backbone
DiffusionGemma4 mockup) see a value consistent across all tied sides.

Implementation has two parts.

1. New sync_tied_input_amax(model) helper. Walks named_modules(),
   groups by source weight data_ptr (same signature our existing dedup
   patches use), and max-merges input_quantizer.amax across each
   group. Uses the canonical 4-line idiom shared with
   preprocess_linear_fusion (quant_utils.py:1394-1401) and
   sync_moe_gate_up_amax (layer_utils.py:1197):

     merged = torch.max(torch.stack([q.amax for q in qs]))
     for q in qs:
         q.amax = merged.clone()

   Handles both dense Linears (keyed by weight.data_ptr) and fused MoE
   modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging
   gate_up_proj_input_quantizer and down_proj_input_quantizer
   independently across the group). Scalar-only, matching
   preprocess_linear_fusion's contract.

   Called unconditionally from _export_transformers_checkpoint after
   sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the
   merged amax flows into _export_quantized_weight's input_scale
   derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when
   applicable, no-op otherwise" convention.

2. Extend the existing tied-weight alias loops in
   _export_quantized_weight and _export_fused_experts to include
   input_scale alongside weight_scale / weight_scale_2. Before this
   commit those loops intentionally skipped input_scale because
   encoder/decoder amaxes legitimately differed (Q2 analysis showed up
   to 18x divergence for down_proj_input on v1). With
   sync_tied_input_amax in place, both sides now derive bit-identical
   input_scale values; aliasing the buffers is safe and lets the
   existing data_ptr dedup in postprocess_state_dict collapse them so
   only one canonical entry per Linear survives in the exported
   safetensors.

Also extends the Q-B canonical-side reorder pass added in commit
837768f with an auto-derived side-substring matcher. HF's
_tied_weights_keys regex patterns target the pre-export module
structure (fused gate_up_proj), but after _export_fused_experts
unpacks them into per-expert gate_proj/up_proj/down_proj submodules,
post-export keys like ...experts.Y.gate_proj.input_scale are not
covered by HF's regex. Without the substring fallback, those keys
fell through Q-B to the "alias-first" partition, so when the new
input_scale alias step shared data_ptrs, the encoder name won the
dedup instead of the decoder name.

_collect_canonical_tied_patterns now returns (patterns,
side_substrings). The side_substrings list is auto-derived from each
_tied_weights_keys entry as the set of dot-separated tokens that
appear in canonical patterns but not in alias patterns. For
DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern
contains "decoder", no alias pattern does. _reorder_canonical_first
treats a key as canonical if it matches a regex pattern OR contains a
side substring as a proper path component (bordered by "." or at
start/end). The path-component requirement avoids false positives
from accidental name collisions.

Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32:
the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries
that the prior export carried are removed; the 11 520 decoder-side
entries remain with the merged amax-derived value. Total bytes drops
by ~1 MB (scalar entries). Other tied-tensor entries (weight,
weight_scale, weight_scale_2) and encoder-only entries (layer_scalar,
30 keys) are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
The diffusion self-conditioning network (block-diffusion models like
DiffusionGemma) is text-only and not exercised by typical calibration
data. Without exclusion its TensorQuantizers never see input, never set
_amax, and export crashes at _export_quantized_weight:

  AttributeError: 'TensorQuantizer' object has no attribute '_amax'

Companion to the upstream vision-tower / visual / embed_vision excludes
already in this unit (PR NVIDIA#1691). Pattern is a no-op for non-diffusion
models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds the tied-modules test fixture plus 10 unit tests covering the
tied-weight machinery introduced earlier in this series.

tests/_test_utils/torch/quantization/tied_modules.py (new):
  Three small factory helpers shared by the unit tests:
  - make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is
    shared via setattr (mimics HF tie_weights() after __init__).
  - tie_fused_experts_3d_params(enc, dec) -- in-place tie of
    gate_up_proj / down_proj between two fused-experts modules (paired
    with the existing _SyntheticFusedExperts fixture).
  - wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent
    nn.Module with HF-style _tied_weights_keys (dict-style for the
    canonical case, list-style for the legacy negative case).
  Each factory asserts post-conditions on the tie so a misuse fails
  loudly at construction.

tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests
  Commit f3e9543ab -- canonical-side reorder:
    - dict-style _tied_weights_keys yields patterns + canonical
      substrings
    - list-style yields no canonical info (reorder becomes a no-op)
    - _reorder_canonical_first puts decoder-side keys ahead of
      encoder-side keys

  Commit 3fb3ba053 -- sync_tied_input_amax:
    - tied Linears with divergent amaxes (2.0 vs 5.0) get both sides
      overwritten with the elementwise max (5.0)
    - untied Linears keep per-side amaxes (no-op when there's no tie)

  Commit 29674a7e1 -- dense Linear tied-weight dedup:
    - tied Linears share data_ptr for packed .weight + scale buffers
    - untied Linears keep independent data_ptrs
    - asymmetric quant: unquantized side early-returns at
      QUANTIZATION_NONE, stays at the original shared Parameter

tests/unit/torch/quantization/plugins/test_fused_experts.py (extended):
  2 tests
  Commit 10a8fdbd5 -- MoE experts dedup:
    - two _SyntheticSparseMoeBlock instances with tied 3-D source
      params share data_ptr across every per-expert buffer
    - untied counterparts keep independent per-expert data_ptrs

Pure-Python; CPU-only; ~1s wall total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds a New Features bullet under 0.46 covering the tied-weight dedup,
canonical-side reorder, sync_tied_input_amax helper, and the
*self_conditioning* default exclude introduced earlier in this series.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
@juhi10071998 juhi10071998 requested review from a team as code owners June 13, 2026 01:04
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds post-quantization-training (PTQ) and HuggingFace checkpoint export support for tied-weight encoder-decoder diffusion models. It recognizes DiffusionGemma, introduces optional canonical-side tied-weight naming, implements tensor aliasing for quantized weight deduplication, synchronizes tied input quantizer scales, and provides comprehensive test coverage.

Changes

Tied-weight PTQ and HF export for encoder-decoder models

Layer / File(s) Summary
DiffusionGemma model type recognition
modelopt/torch/export/model_utils.py, modelopt/torch/utils/dataset_utils.py, examples/llm_ptq/example_utils.py
DiffusionGemma is added to model-type mappings and encoder-decoder model lists with substring-matching order comments to avoid false matches.
CLI integration and generation output handling
examples/llm_ptq/hf_ptq.py
Adds --canonical_tied_naming CLI flag to opt into HF canonical-side tied-weight ordering; unwraps ModelOutput objects from .generate() so diffusion preview decoding operates on token tensors consistently.
Test utilities for tied-weight scenarios
tests/_test_utils/torch/quantization/tied_modules.py
Introduces factory functions to construct tied linear pairs, fused-expert tied parameters, and parent modules with HF-style _tied_weights_keys declarations for repeatable test setup.
Core tied-weight quantized export logic
modelopt/torch/export/unified_export_hf.py
Implements tied-weight deduplication via source tensor tracking and cached parameter aliasing; adds canonical tied-pattern detection and optional state-dict reordering; introduces sync_tied_input_amax() to max-merge input quantizer scales across tied modules.
Fused experts tied-weight export deduplication
modelopt/torch/export/moe_utils.py
Extends _export_fused_experts to capture source fused-expert identities before unpacking and alias per-expert projection weights/scales across tied fused-expert modules via module-level cache.
Test coverage for tied-weight unified export
tests/unit/torch/export/test_unified_export_hf.py
Verifies canonical pattern detection, state-dict reordering, input amax synchronization, and quantized-weight aliasing for tied and untied scenarios.
Test coverage for fused experts tied deduplication
tests/unit/torch/quantization/plugins/test_fused_experts.py
Validates that tied fused-expert per-expert projections and scales share storage across modules while untied experts remain independent.
Configuration and documentation updates
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml, CHANGELOG.rst
Adds *self_conditioning* disabled quantizer rule for diffusion models and documents all tied-weight features.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1327: Both PRs modify the HF quantized-weight export logic in modelopt/torch/export/unified_export_hf.py—main PR adds tied-weight canonical/aliasing dedup in _export_quantized_weight, while retrieved PR extends _export_quantized_weight for nvfp4_w4a16 and embedding export eligibility—so they're code-level related via overlapping changes in the same function.

Suggested reviewers

  • realAsma
  • meenchen
  • sugunav14
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately reflects the main change: adding tied-weight PTQ export support for encoder-decoder block-diffusion models like DiffusionGemma, which is the primary focus of all file changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR diff has no CRITICAL issues: no torch.load(weights_only=False), numpy.load(allow_pickle=True), trust_remote_code=True, eval/exec, or '# nosec'. citeturn3view1turn3view3turn3view0turn3view5...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@juhi10071998 juhi10071998 changed the title Add tied-weight PTQ export support for block-diffusion encoder-decoder models (DiffusionGemma) Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ] Jun 13, 2026
@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.60714% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.27%. Comparing base (2640551) to head (0543907).

Files with missing lines Patch % Lines
modelopt/torch/export/unified_export_hf.py 87.20% 11 Missing ⚠️
modelopt/torch/export/moe_utils.py 84.00% 4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (2640551) and HEAD (0543907). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (2640551) HEAD (0543907)
unit 2 1
gpu 4 3
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1707      +/-   ##
==========================================
- Coverage   76.41%   68.27%   -8.14%     
==========================================
  Files         511      511              
  Lines       56236    56346     +110     
==========================================
- Hits        42970    38473    -4497     
- Misses      13266    17873    +4607     
Flag Coverage Δ
examples 41.82% <33.03%> (+2.35%) ⬆️
gpu 31.58% <2.67%> (-26.80%) ⬇️
regression 14.67% <2.67%> (+0.04%) ⬆️
unit 54.49% <80.35%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)

561-568: 💤 Low value

Move import to top of file or add justification comment.

The import at line 566 is inside a function without a comment explaining why. Per CONTRIBUTING.md, imports belong at the top so errors surface at collection time. _export_quantized_weight is from a core modelopt module with no circular-import or optional-dependency concern.

♻️ Suggested fix

Add the import near the other modelopt.torch.export imports at the top of the file:

 from modelopt.torch.export.moe_utils import _export_fused_experts
 from modelopt.torch.export.quant_utils import get_quant_config
+from modelopt.torch.export.unified_export_hf import _export_quantized_weight

Then simplify the helper:

 def _clear_fused_experts_caches():
     """Clear function-static alias caches in both export entry points."""
     _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None)
-    # _export_fused_experts internally calls _export_quantized_weight per per-expert
-    # wrapper; clear that cache too so each test sees a pristine state.
-    from modelopt.torch.export.unified_export_hf import _export_quantized_weight
-
     _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 561
- 568, Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 45-51: The docstring for the tied-experts dedup block contradicts
the implementation by saying "input_scale is left per-side" while the code in
moe_utils.py aliases input_scale alongside weight_scale and weight_scale_2 (due
to sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1545-1547: The call to _export_transformers_checkpoint is dropping
**kwargs (so options like accelerator are ignored); update the call in
unified_export_hf.py where post_state_dict, hf_quant_config =
_export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.
- Around line 721-744: The tied-weight alias cache (_tied_weight_alias_cache) is
stored on the function object of _export_quantized_weight and persists across
exports; reset it at the start of each export invocation to avoid stale aliases:
in the _export_quantized_weight function (or the outer export entrypoint that
calls it) initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 561-568: Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f4075f4-f09e-491d-b111-ffe09bfe2a7a

📥 Commits

Reviewing files that changed from the base of the PR and between 2640551 and 0543907.

📒 Files selected for processing (11)
  • CHANGELOG.rst
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/export/model_utils.py
  • modelopt/torch/export/moe_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/utils/dataset_utils.py
  • modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
  • tests/_test_utils/torch/quantization/tied_modules.py
  • tests/unit/torch/export/test_unified_export_hf.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py

Comment on lines +45 to +51

Tied-experts dedup: when multiple fused-expert modules share their 3-D
source params via HF ``_tied_weights_keys``, the unpacking creates fresh
per-expert tensors that break the tie. We cache the source ``data_ptr()``
at entry and on a later cache hit alias the per-expert ``weight`` /
``weight_scale`` / ``weight_scale_2`` back to the prior module so
downstream dedup catches them. ``input_scale`` is left per-side.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Docstring contradicts implementation for input_scale aliasing.

Line 51 states input_scale is left per-side, but line 221 explicitly aliases input_scale along with weight_scale and weight_scale_2. The implementation comment at lines 195-198 correctly explains that input_scale IS aliased because sync_tied_input_amax runs earlier.

📝 Suggested docstring fix
     Tied-experts dedup: when multiple fused-expert modules share their 3-D
     source params via HF ``_tied_weights_keys``, the unpacking creates fresh
     per-expert tensors that break the tie. We cache the source ``data_ptr()``
     at entry and on a later cache hit alias the per-expert ``weight`` /
-    ``weight_scale`` / ``weight_scale_2`` back to the prior module so
-    downstream dedup catches them. ``input_scale`` is left per-side.
+    ``weight_scale`` / ``weight_scale_2`` / ``input_scale`` back to the prior
+    module so downstream dedup catches them. ``input_scale`` aliasing is safe
+    because ``sync_tied_input_amax`` runs earlier and max-merges the shared
+    input_quantizer amaxes, so both sides derive bit-identical values.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/moe_utils.py` around lines 45 - 51, The docstring for
the tied-experts dedup block contradicts the implementation by saying
"input_scale is left per-side" while the code in moe_utils.py aliases
input_scale alongside weight_scale and weight_scale_2 (due to
sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.

Comment on lines +721 to +744
# Tied-weight dedup: if a previously-processed module shared the same
# source weight memory, alias the packed weight + scale buffers so the
# downstream data_ptr dedup in postprocess_state_dict can collapse them.
# input_scale is safe to alias because sync_tied_input_amax (earlier in
# this export) already max-merged the per-side amaxes.
_cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {})
_prior = _cache.get(_tied_source_data_ptr)
if _prior is not None and _prior is not sub_module:
if hasattr(_prior, weight_name):
setattr(sub_module, weight_name, getattr(_prior, weight_name))
for _attr in (
quantizer_attrs.weight_scale,
quantizer_attrs.weight_scale_2,
quantizer_attrs.input_scale,
):
if _attr is None or not hasattr(_prior, _attr):
continue
if _attr in sub_module._buffers:
del sub_module._buffers[_attr]
elif hasattr(sub_module, _attr):
delattr(sub_module, _attr)
sub_module.register_buffer(_attr, getattr(_prior, _attr))
else:
_cache[_tied_source_data_ptr] = sub_module

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset tied-weight alias cache per export invocation.

_tied_weight_alias_cache is function-static and never cleared in the export path. Across multiple exports in one process, recycled data_ptr() values can hit stale entries and alias to modules from a prior export, which can corrupt checkpoint contents and retain unnecessary module references.

💡 Suggested fix
 def _process_quantized_modules(
     model: nn.Module,
     dtype: torch.dtype,
     is_modelopt_qlora: bool = False,
 ) -> None:
+    # Isolate aliasing state to a single export run.
+    _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
+
     """Process all quantized modules in model, export weights in-place.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_hf.py` around lines 721 - 744, The
tied-weight alias cache (_tied_weight_alias_cache) is stored on the function
object of _export_quantized_weight and persists across exports; reset it at the
start of each export invocation to avoid stale aliases: in the
_export_quantized_weight function (or the outer export entrypoint that calls it)
initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.

Comment on lines +1545 to +1547
post_state_dict, hf_quant_config = _export_transformers_checkpoint(
model, dtype, canonical_tied_naming=canonical_tied_naming
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Forward **kwargs to preserve transformer export contract.

The new call drops **kwargs, so options consumed inside _export_transformers_checkpoint (e.g., accelerator) are silently ignored. That can break distributed state-dict gathering and produce incomplete exports.

💡 Suggested fix
-        post_state_dict, hf_quant_config = _export_transformers_checkpoint(
-            model, dtype, canonical_tied_naming=canonical_tied_naming
-        )
+        post_state_dict, hf_quant_config = _export_transformers_checkpoint(
+            model,
+            dtype,
+            canonical_tied_naming=canonical_tied_naming,
+            **kwargs,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_hf.py` around lines 1545 - 1547, The
call to _export_transformers_checkpoint is dropping **kwargs (so options like
accelerator are ignored); update the call in unified_export_hf.py where
post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant