Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707
Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707juhi10071998 wants to merge 8 commits into
Conversation
…ptq.py
DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion
text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder +
iterative-decoder + self-conditioning + 48-step denoising loop). Four
small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end
for it; no new entry points needed.
Substring patterns are chosen so they ALSO match the previous class
name DiffusionGemma4ModelForBlockDiffusion (and previous model_type
"diffusion_gemma4"). The model was renamed in transformers
mid-development; using "diffusiongemma" / "DiffusionGemma" /
"diffusion_gemma" matches both the current and the older class so we
don't silently regress on either checkpoint generation.
1. modelopt/torch/utils/dataset_utils.py
Add "diffusiongemma" to the substring list in model_type_is_enc_dec.
This routes ModelOpt's built-in calibration forward_loop to
model.generate() instead of a single model.forward(), so calibration
exercises the full 48-step inner denoising loop and sees the entire
noise->clean activation distribution.
2. modelopt/torch/export/model_utils.py
Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE,
*before* "Gemma". get_model_type does substring matching; "gemma" is
a substring of "diffusiongemma" so without this entry the class is
silently mis-classified as plain "gemma" -- which then mis-routes
downstream model-type-dependent logic.
3. examples/llm_ptq/hf_ptq.py (output_decode)
DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput
ModelOutput dataclass, not a bare token tensor. The preview decode
code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences
at the top of output_decode so both the enc-dec and AR slicing
branches work uniformly. Generic shim -- helps any model whose
.generate returns a ModelOutput with a .sequences attribute.
4. examples/llm_ptq/example_utils.py (is_enc_dec)
Comment-only clarification on the semantics. is_enc_dec controls
how hf_ptq.py:output_decode slices the preview result (whether to
strip the prompt prefix). For T5/BART/Whisper generate returns only
new tokens. For DiffusionGemma it returns prompt+canvas
concatenated, so it belongs with the AR slicing path and stays out
of this list. No behavioral change; documents the prior intent so
future contributors don't re-add diffusion_gemma here.
End-to-end smoke test:
python hf_ptq.py --pyt_ckpt_path <local northbloom checkpoint> \
--qformat nvfp4_experts_only \
--calib_size 32 --trust_remote_code \
--export_path <somewhere>
produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
When a model has multiple fused-experts modules whose 3-D source params share storage via HF _tied_weights_keys (e.g. the encoder and decoder transformer stacks of a block-diffusion encoder-decoder LLM like DiffusionGemma4 / northbloom), the unpacking loop in _export_fused_experts ordinarily creates fresh per-expert tensors for each call — destroying the tied identity and writing two full sets of expert weights + scales to disk. This adds a function-local cache keyed by (gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the existing unpacking path runs unchanged. On a cache hit, after the normal unpacking completes, the per-expert weight / weight_scale / weight_scale_2 buffers are re-pointed at the prior module's tensors so they share storage. The downstream postprocess_state_dict data_ptr()-based dedup then catches them and drops the duplicates from the saved checkpoint. input_scale is intentionally NOT aliased: encoder and decoder paths have legitimately different activation distributions (verified across all 60 tied pairs of a 512-prompt calibration — down_proj_input divergence median 2.26x, max 18.4x), so each side keeps its own per-side calibrated scale. Model-agnostic: no name regex, no model-type lookup. Cache miss falls through to existing behavior, so non-tied models are unaffected. Empirical result on DiffusionGemma4 26B at nvfp4_experts_only: safetensors 28.43 GB -> 16.47 GB (-42%) shards 4 -> 2 decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each decoder input_scale entries: 3840 -> 3840 (kept per-side, as intended) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Symmetric companion to the data_ptr-cache alias added to
_export_fused_experts in the prior commit. Catches plain nn.Linear
modules whose .weight Parameters are tied via HF tie_weights() (e.g.
encoder and decoder attention QKV/O, MLP gate/up/down, router proj of
an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they
are quantized under recipes that route through _export_quantized_weight.
Mechanism: capture weight.data_ptr() at the top of the function,
before the setattr further down wraps the packed bytes in a fresh
nn.Parameter (which destroys the tie). At the end of the function,
consult a function-local cache keyed by that captured data_ptr. On
cache miss: register this sub_module as the canonical owner. On cache
hit (a previously-processed module shared the same source weight
memory): alias .weight, weight_scale, weight_scale_2 to the prior
module's tensors so downstream data_ptr-based dedup in
postprocess_state_dict drops the duplicates. input_scale is
intentionally NOT aliased — calibration legitimately diverges per-side
(verified in Q2 analysis: down_proj_input ratio up to 18x across 60
tied pairs on the northbloom DiffusionGemma4 model).
Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense
Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach
this function but have fresh data_ptrs from upstream slice+contiguous
so cache misses always). Under full nvfp4 it fires for every tied
dense Linear pair.
Same safety guarantees as the existing data_ptr-based dedup: cannot
false-positive because it only aliases when source memory was already
shared by the model author (via tie_weights or equivalent). No name
regex, no _tied_weights_keys lookup, no model introspection.
Empirical results on DiffusionGemma4 26B (calib_size=8 smoke):
nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior
experts-only export; this
patch a no-op as expected)
nvfp4 (full): ~27 GB est -> 14.24 GB (-12.7 GB; both patches
firing on disjoint
module sets; per-name
dedup verified in index)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds an opt-in pass (--canonical_tied_naming, default off) that reorders
the state_dict before postprocess_state_dict so that keys matching the
canonical side of HF's _tied_weights_keys declaration iterate before
their aliases. The existing first-wins data_ptr dedup at
quant_utils.py:1148-1163 then drops the alias names, leaving the
canonical names in the exported safetensors.
Motivation: for models like DiffusionGemma4 (northbloom), HF declares
{alias: canonical} via _tied_weights_keys, where the encoder side is
the alias and the decoder side is canonical. The original HF
safetensors index uses decoder-prefixed names for all tied weights
(661/691 keys vs 30/691 encoder-only layer_scalar keys). The
single-backbone vLLM mockup loader strips both prefixes to model.* and
relies on this canonical naming.
The default modelopt export today walks the model in registration
order (encoder before decoder, per the model's __init__ order), so
encoder names win in the first-wins dedup. The exported checkpoint
thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards
relative to the upstream HF naming and to what downstream consumers
expect.
Implementation. _tied_weights_keys is declared per model class with
paths relative to that class. In nested models (e.g. DiffusionGemma4)
multiple submodules declare their own ties: the outer wrapper at
DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to
model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model
at model.model declares the much larger encoder<->decoder dict with
paths relative to itself.
_collect_canonical_tied_patterns walks model.named_modules() and
collects every dict-style _tied_weights_keys declaration, prefixing
each pattern with the submodule's qualified path so the regexes match
against root-level state_dict keys. Without the prefix, the inner
dict's patterns (which lack a "model." prefix) silently fail to match
keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug
that would cause only the outer dict's single entry (embed_tokens) to
flip in the dedup.
_reorder_canonical_first then partitions the state_dict into head
(canonical-pattern matches) and tail (everything else), preserving
original order within each partition. head.update(tail) yields a
single dict with canonical keys first. The downstream dedup loop
iterates this in insertion order and records the canonical names in
seen_tensors; alias names then arrive as duplicates and are dropped.
Scope of behavior change:
- Models with no _tied_weights_keys, or only legacy list-of-strings
declarations: _collect returns an empty pattern list, helper
short-circuits, state_dict returned unchanged. Zero effect on any
existing modelopt user.
- Models with dict-style declarations (e.g. DiffusionGemma4): when
the flag is set, canonical-side names win. When the flag is unset
(default), behavior is identical to before this commit.
No changes to dedup logic itself, to the existing tied-weight alias
patches in _export_quantized_weight and _export_fused_experts, or to
_process_quantized_modules iteration. Strictly additive.
Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size
32: 35 127 encoder keys removed (decoder kept), 0 decoder keys
removed; layer_scalar (30 encoder-only keys) and per-side input_scale
(11 520 keys, intentionally not deduped) unaffected; total safetensors
bytes 17.68 GB matching the prior export to within 500 KB of
safetensors-metadata-ordering noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Companion piece to the existing tied-weight alias patches in _export_fused_experts (commit 8b00e85) and _export_quantized_weight (commit e8c36b0), which already alias bit-identical weight / weight_scale / weight_scale_2 between tied modules but leave input_scale per-side. This commit closes the loop on input_scale so consumers that load a single canonical scale per Linear (e.g. vLLM's single-backbone DiffusionGemma4 mockup) see a value consistent across all tied sides. Implementation has two parts. 1. New sync_tied_input_amax(model) helper. Walks named_modules(), groups by source weight data_ptr (same signature our existing dedup patches use), and max-merges input_quantizer.amax across each group. Uses the canonical 4-line idiom shared with preprocess_linear_fusion (quant_utils.py:1394-1401) and sync_moe_gate_up_amax (layer_utils.py:1197): merged = torch.max(torch.stack([q.amax for q in qs])) for q in qs: q.amax = merged.clone() Handles both dense Linears (keyed by weight.data_ptr) and fused MoE modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging gate_up_proj_input_quantizer and down_proj_input_quantizer independently across the group). Scalar-only, matching preprocess_linear_fusion's contract. Called unconditionally from _export_transformers_checkpoint after sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the merged amax flows into _export_quantized_weight's input_scale derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when applicable, no-op otherwise" convention. 2. Extend the existing tied-weight alias loops in _export_quantized_weight and _export_fused_experts to include input_scale alongside weight_scale / weight_scale_2. Before this commit those loops intentionally skipped input_scale because encoder/decoder amaxes legitimately differed (Q2 analysis showed up to 18x divergence for down_proj_input on v1). With sync_tied_input_amax in place, both sides now derive bit-identical input_scale values; aliasing the buffers is safe and lets the existing data_ptr dedup in postprocess_state_dict collapse them so only one canonical entry per Linear survives in the exported safetensors. Also extends the Q-B canonical-side reorder pass added in commit 837768f with an auto-derived side-substring matcher. HF's _tied_weights_keys regex patterns target the pre-export module structure (fused gate_up_proj), but after _export_fused_experts unpacks them into per-expert gate_proj/up_proj/down_proj submodules, post-export keys like ...experts.Y.gate_proj.input_scale are not covered by HF's regex. Without the substring fallback, those keys fell through Q-B to the "alias-first" partition, so when the new input_scale alias step shared data_ptrs, the encoder name won the dedup instead of the decoder name. _collect_canonical_tied_patterns now returns (patterns, side_substrings). The side_substrings list is auto-derived from each _tied_weights_keys entry as the set of dot-separated tokens that appear in canonical patterns but not in alias patterns. For DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern contains "decoder", no alias pattern does. _reorder_canonical_first treats a key as canonical if it matches a regex pattern OR contains a side substring as a proper path component (bordered by "." or at start/end). The path-component requirement avoids false positives from accidental name collisions. Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32: the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries that the prior export carried are removed; the 11 520 decoder-side entries remain with the merged amax-derived value. Total bytes drops by ~1 MB (scalar entries). Other tied-tensor entries (weight, weight_scale, weight_scale_2) and encoder-only entries (layer_scalar, 30 keys) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
The diffusion self-conditioning network (block-diffusion models like DiffusionGemma) is text-only and not exercised by typical calibration data. Without exclusion its TensorQuantizers never see input, never set _amax, and export crashes at _export_quantized_weight: AttributeError: 'TensorQuantizer' object has no attribute '_amax' Companion to the upstream vision-tower / visual / embed_vision excludes already in this unit (PR NVIDIA#1691). Pattern is a no-op for non-diffusion models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds the tied-modules test fixture plus 10 unit tests covering the
tied-weight machinery introduced earlier in this series.
tests/_test_utils/torch/quantization/tied_modules.py (new):
Three small factory helpers shared by the unit tests:
- make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is
shared via setattr (mimics HF tie_weights() after __init__).
- tie_fused_experts_3d_params(enc, dec) -- in-place tie of
gate_up_proj / down_proj between two fused-experts modules (paired
with the existing _SyntheticFusedExperts fixture).
- wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent
nn.Module with HF-style _tied_weights_keys (dict-style for the
canonical case, list-style for the legacy negative case).
Each factory asserts post-conditions on the tie so a misuse fails
loudly at construction.
tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests
Commit f3e9543ab -- canonical-side reorder:
- dict-style _tied_weights_keys yields patterns + canonical
substrings
- list-style yields no canonical info (reorder becomes a no-op)
- _reorder_canonical_first puts decoder-side keys ahead of
encoder-side keys
Commit 3fb3ba053 -- sync_tied_input_amax:
- tied Linears with divergent amaxes (2.0 vs 5.0) get both sides
overwritten with the elementwise max (5.0)
- untied Linears keep per-side amaxes (no-op when there's no tie)
Commit 29674a7e1 -- dense Linear tied-weight dedup:
- tied Linears share data_ptr for packed .weight + scale buffers
- untied Linears keep independent data_ptrs
- asymmetric quant: unquantized side early-returns at
QUANTIZATION_NONE, stays at the original shared Parameter
tests/unit/torch/quantization/plugins/test_fused_experts.py (extended):
2 tests
Commit 10a8fdbd5 -- MoE experts dedup:
- two _SyntheticSparseMoeBlock instances with tied 3-D source
params share data_ptr across every per-expert buffer
- untied counterparts keep independent per-expert data_ptrs
Pure-Python; CPU-only; ~1s wall total.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds a New Features bullet under 0.46 covering the tied-weight dedup, canonical-side reorder, sync_tied_input_amax helper, and the *self_conditioning* default exclude introduced earlier in this series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds post-quantization-training (PTQ) and HuggingFace checkpoint export support for tied-weight encoder-decoder diffusion models. It recognizes DiffusionGemma, introduces optional canonical-side tied-weight naming, implements tensor aliasing for quantized weight deduplication, synchronizes tied input quantizer scales, and provides comprehensive test coverage. ChangesTied-weight PTQ and HF export for encoder-decoder models
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1707 +/- ##
==========================================
- Coverage 76.41% 68.27% -8.14%
==========================================
Files 511 511
Lines 56236 56346 +110
==========================================
- Hits 42970 38473 -4497
- Misses 13266 17873 +4607
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
561-568: 💤 Low valueMove import to top of file or add justification comment.
The import at line 566 is inside a function without a comment explaining why. Per CONTRIBUTING.md, imports belong at the top so errors surface at collection time.
_export_quantized_weightis from a core modelopt module with no circular-import or optional-dependency concern.♻️ Suggested fix
Add the import near the other
modelopt.torch.exportimports at the top of the file:from modelopt.torch.export.moe_utils import _export_fused_experts from modelopt.torch.export.quant_utils import get_quant_config +from modelopt.torch.export.unified_export_hf import _export_quantized_weightThen simplify the helper:
def _clear_fused_experts_caches(): """Clear function-static alias caches in both export entry points.""" _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None) - # _export_fused_experts internally calls _export_quantized_weight per per-expert - # wrapper; clear that cache too so each test sees a pristine state. - from modelopt.torch.export.unified_export_hf import _export_quantized_weight - _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 561 - 568, Move the local import of _export_quantized_weight out of _clear_fused_experts_caches and place it with the other modelopt.torch.export imports at the top of the test file (or, if there is a deliberate reason to import inside the function, add a one-line justification comment explaining why); then simplify _clear_fused_experts_caches to directly reference _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None) alongside the existing _export_fused_experts cache pop. Ensure you reference the symbols _clear_fused_experts_caches, _export_quantized_weight, and _export_fused_experts when making the change.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 45-51: The docstring for the tied-experts dedup block contradicts
the implementation by saying "input_scale is left per-side" while the code in
moe_utils.py aliases input_scale alongside weight_scale and weight_scale_2 (due
to sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1545-1547: The call to _export_transformers_checkpoint is dropping
**kwargs (so options like accelerator are ignored); update the call in
unified_export_hf.py where post_state_dict, hf_quant_config =
_export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.
- Around line 721-744: The tied-weight alias cache (_tied_weight_alias_cache) is
stored on the function object of _export_quantized_weight and persists across
exports; reset it at the start of each export invocation to avoid stale aliases:
in the _export_quantized_weight function (or the outer export entrypoint that
calls it) initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 561-568: Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1f4075f4-f09e-491d-b111-ffe09bfe2a7a
📒 Files selected for processing (11)
CHANGELOG.rstexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/export/model_utils.pymodelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/utils/dataset_utils.pymodelopt_recipes/configs/ptq/units/default_disabled_quantizers.yamltests/_test_utils/torch/quantization/tied_modules.pytests/unit/torch/export/test_unified_export_hf.pytests/unit/torch/quantization/plugins/test_fused_experts.py
|
|
||
| Tied-experts dedup: when multiple fused-expert modules share their 3-D | ||
| source params via HF ``_tied_weights_keys``, the unpacking creates fresh | ||
| per-expert tensors that break the tie. We cache the source ``data_ptr()`` | ||
| at entry and on a later cache hit alias the per-expert ``weight`` / | ||
| ``weight_scale`` / ``weight_scale_2`` back to the prior module so | ||
| downstream dedup catches them. ``input_scale`` is left per-side. |
There was a problem hiding this comment.
Docstring contradicts implementation for input_scale aliasing.
Line 51 states input_scale is left per-side, but line 221 explicitly aliases input_scale along with weight_scale and weight_scale_2. The implementation comment at lines 195-198 correctly explains that input_scale IS aliased because sync_tied_input_amax runs earlier.
📝 Suggested docstring fix
Tied-experts dedup: when multiple fused-expert modules share their 3-D
source params via HF ``_tied_weights_keys``, the unpacking creates fresh
per-expert tensors that break the tie. We cache the source ``data_ptr()``
at entry and on a later cache hit alias the per-expert ``weight`` /
- ``weight_scale`` / ``weight_scale_2`` back to the prior module so
- downstream dedup catches them. ``input_scale`` is left per-side.
+ ``weight_scale`` / ``weight_scale_2`` / ``input_scale`` back to the prior
+ module so downstream dedup catches them. ``input_scale`` aliasing is safe
+ because ``sync_tied_input_amax`` runs earlier and max-merges the shared
+ input_quantizer amaxes, so both sides derive bit-identical values.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/export/moe_utils.py` around lines 45 - 51, The docstring for
the tied-experts dedup block contradicts the implementation by saying
"input_scale is left per-side" while the code in moe_utils.py aliases
input_scale alongside weight_scale and weight_scale_2 (due to
sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.
| # Tied-weight dedup: if a previously-processed module shared the same | ||
| # source weight memory, alias the packed weight + scale buffers so the | ||
| # downstream data_ptr dedup in postprocess_state_dict can collapse them. | ||
| # input_scale is safe to alias because sync_tied_input_amax (earlier in | ||
| # this export) already max-merged the per-side amaxes. | ||
| _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {}) | ||
| _prior = _cache.get(_tied_source_data_ptr) | ||
| if _prior is not None and _prior is not sub_module: | ||
| if hasattr(_prior, weight_name): | ||
| setattr(sub_module, weight_name, getattr(_prior, weight_name)) | ||
| for _attr in ( | ||
| quantizer_attrs.weight_scale, | ||
| quantizer_attrs.weight_scale_2, | ||
| quantizer_attrs.input_scale, | ||
| ): | ||
| if _attr is None or not hasattr(_prior, _attr): | ||
| continue | ||
| if _attr in sub_module._buffers: | ||
| del sub_module._buffers[_attr] | ||
| elif hasattr(sub_module, _attr): | ||
| delattr(sub_module, _attr) | ||
| sub_module.register_buffer(_attr, getattr(_prior, _attr)) | ||
| else: | ||
| _cache[_tied_source_data_ptr] = sub_module |
There was a problem hiding this comment.
Reset tied-weight alias cache per export invocation.
_tied_weight_alias_cache is function-static and never cleared in the export path. Across multiple exports in one process, recycled data_ptr() values can hit stale entries and alias to modules from a prior export, which can corrupt checkpoint contents and retain unnecessary module references.
💡 Suggested fix
def _process_quantized_modules(
model: nn.Module,
dtype: torch.dtype,
is_modelopt_qlora: bool = False,
) -> None:
+ # Isolate aliasing state to a single export run.
+ _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
+
"""Process all quantized modules in model, export weights in-place.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/export/unified_export_hf.py` around lines 721 - 744, The
tied-weight alias cache (_tied_weight_alias_cache) is stored on the function
object of _export_quantized_weight and persists across exports; reset it at the
start of each export invocation to avoid stale aliases: in the
_export_quantized_weight function (or the outer export entrypoint that calls it)
initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.
| post_state_dict, hf_quant_config = _export_transformers_checkpoint( | ||
| model, dtype, canonical_tied_naming=canonical_tied_naming | ||
| ) |
There was a problem hiding this comment.
Forward **kwargs to preserve transformer export contract.
The new call drops **kwargs, so options consumed inside _export_transformers_checkpoint (e.g., accelerator) are silently ignored. That can break distributed state-dict gathering and produce incomplete exports.
💡 Suggested fix
- post_state_dict, hf_quant_config = _export_transformers_checkpoint(
- model, dtype, canonical_tied_naming=canonical_tied_naming
- )
+ post_state_dict, hf_quant_config = _export_transformers_checkpoint(
+ model,
+ dtype,
+ canonical_tied_naming=canonical_tied_naming,
+ **kwargs,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/export/unified_export_hf.py` around lines 1545 - 1547, The
call to _export_transformers_checkpoint is dropping **kwargs (so options like
accelerator are ignored); update the call in unified_export_hf.py where
post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.
What does this PR do?
Type of change: new feature
Adds end-to-end PTQ + HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF
_tied_weights_keys. Six source commits + one test commit + one CHANGELOG entry — purely additive for existing modelopt users (non-tied models see no behavioral change).Source commits:
Onboard DiffusionGemma to
hf_ptq.py— substring-list additions inmodel_type_is_enc_decandMODEL_NAME_TO_TYPE(so calibration routes through.generate()), and a.sequencesunwrap in the preview decode forModelOutput-returning.generate()s.MoE experts dedup in
_export_fused_experts— when two fused-expert modules share their 3-D source params, alias the per-expert packed weight + scales on cache hit so downstreampostprocess_state_dictdedup catches them. ~42% storage reduction onnvfp4_experts_onlyfor tied 26B MoE checkpoints.Dense Linear dedup in
_export_quantized_weight— symmetric to (2) for dense Linears; no-op fornvfp4_experts_only(dense early-returns atQUANTIZATION_NONE).Opt-in canonical-side reorder (
--canonical_tied_naming, default off) — partitions state_dict so canonical-side tied keys iterate before alias-side, letting first-wins dedup keep canonical names.sync_tied_input_amax— max-merges per-sideinput_quantizer.amaxacross tied modules BEFORE export, so single-backbone consumers (vLLM) that load oneinput_scaleper parameter don't clip on either side. Extends commits 2+3's alias loops to includeinput_scale.Default
*self_conditioning*exclude — adds the diffusion-model self-conditioning wildcard todefault_disabled_quantizers.yaml. Companion to PR Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) #1691 which already added the vision-module excludes.Test commit:
tests/_test_utils/torch/quantization/tied_modules.py) with three factories (make_tied_linear_pair,tie_fused_experts_3d_params,wrap_in_parent_with_tied_keys) + 10 unit tests covering commits 2–5 acrosstests/unit/torch/export/test_unified_export_hf.py(new) andtests/unit/torch/quantization/plugins/test_fused_experts.py(extended). Pure-Python, CPU-only, ~1s wall total.Docs commit:
Usage
Testing
nvfp4_experts_onlyrecipe +--canonical_tied_naming true— produces a 2-shard, ~18 GB safetensors checkpoint with decoder-canonical naming.Before your PR is "Ready for review"
--canonical_tied_namingis opt-in default off; the dedup/alias logic is cache-based ondata_ptr()and no-ops for non-tied models;sync_tied_input_amaxno-ops when no two modules share a weightdata_ptr; the*self_conditioning*wildcard is a no-op for models without matching module names._test_utils/claude reviewAdditional Information
Validated locally against DiffusionGemma v10 (
DiffusionGemmaForBlockDiffusion,model_type: diffusion_gemma). Substring patterns are also chosen to match the olderDiffusionGemma4ModelForBlockDiffusion/diffusion_gemma4spelling, so the same code path works on earlier and current model versions.Summary by CodeRabbit
New Features
Improvements
Tests