From 47d4ab639bf816a9728496203005848a7584466d Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Thu, 28 May 2026 00:32:19 +0000
Subject: [PATCH 1/8] Onboard DiffusionGemma (northbloom) block-diffusion text
 model to hf_ptq.py

DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion
text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder +
iterative-decoder + self-conditioning + 48-step denoising loop). Four
small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end
for it; no new entry points needed.

Substring patterns are chosen so they ALSO match the previous class
name DiffusionGemma4ModelForBlockDiffusion (and previous model_type
"diffusion_gemma4"). The model was renamed in transformers
mid-development; using "diffusiongemma" / "DiffusionGemma" /
"diffusion_gemma" matches both the current and the older class so we
don't silently regress on either checkpoint generation.

1. modelopt/torch/utils/dataset_utils.py
   Add "diffusiongemma" to the substring list in model_type_is_enc_dec.
   This routes ModelOpt's built-in calibration forward_loop to
   model.generate() instead of a single model.forward(), so calibration
   exercises the full 48-step inner denoising loop and sees the entire
   noise->clean activation distribution.

2. modelopt/torch/export/model_utils.py
   Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE,
   *before* "Gemma". get_model_type does substring matching; "gemma" is
   a substring of "diffusiongemma" so without this entry the class is
   silently mis-classified as plain "gemma" -- which then mis-routes
   downstream model-type-dependent logic.

3. examples/llm_ptq/hf_ptq.py (output_decode)
   DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput
   ModelOutput dataclass, not a bare token tensor. The preview decode
   code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences
   at the top of output_decode so both the enc-dec and AR slicing
   branches work uniformly. Generic shim -- helps any model whose
   .generate returns a ModelOutput with a .sequences attribute.

4. examples/llm_ptq/example_utils.py (is_enc_dec)
   Comment-only clarification on the semantics. is_enc_dec controls
   how hf_ptq.py:output_decode slices the preview result (whether to
   strip the prompt prefix). For T5/BART/Whisper generate returns only
   new tokens. For DiffusionGemma it returns prompt+canvas
   concatenated, so it belongs with the AR slicing path and stays out
   of this list. No behavioral change; documents the prior intent so
   future contributors don't re-add diffusion_gemma here.

End-to-end smoke test:
  python hf_ptq.py --pyt_ckpt_path <local northbloom checkpoint> \
                   --qformat nvfp4_experts_only \
                   --calib_size 32 --trust_remote_code \
                   --export_path <somewhere>
produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 examples/llm_ptq/example_utils.py     | 8 +++++++-
 examples/llm_ptq/hf_ptq.py            | 5 +++++
 modelopt/torch/export/model_utils.py  | 3 +++
 modelopt/torch/utils/dataset_utils.py | 5 ++++-
 4 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/examples/llm_ptq/example_utils.py b/examples/llm_ptq/example_utils.py
index d36754a8d42..1d3c196d428 100755
--- a/examples/llm_ptq/example_utils.py
+++ b/examples/llm_ptq/example_utils.py
@@ -806,7 +806,13 @@ def is_model_on_gpu(model) -> bool:
 
 
 def is_enc_dec(model_type) -> bool:
-    """Return if the model is a encoder-decoder model."""
+    """Return whether the model_type uses encoder-decoder-style preview decode.
+
+    Controls whether ``hf_ptq.py`` slices off the prompt prefix from
+    ``.generate()`` output. ``diffusion_gemma`` is structurally encoder-decoder
+    but returns prompt+canvas concatenated, so it stays OFF this list (AR-style
+    decode applies).
+    """
     return model_type in ["t5", "bart", "whisper"]
 
 
diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py
index afb725988c8..6b7a5d773a6 100755
--- a/examples/llm_ptq/hf_ptq.py
+++ b/examples/llm_ptq/hf_ptq.py
@@ -941,6 +941,11 @@ def input_decode(input_ids):
             raise ValueError("The processor or tokenizer must be set")
 
     def output_decode(generated_ids, input_shape):
+        # Some `.generate()` returns a ModelOutput dataclass (e.g. DiffusionGemma);
+        # unwrap to the token tensor so downstream slicing works uniformly.
+        if hasattr(generated_ids, "sequences"):
+            generated_ids = generated_ids.sequences
+
         if is_enc_dec(model_type):
             if processor is not None and isinstance(processor, WhisperProcessor):
                 return processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
diff --git a/modelopt/torch/export/model_utils.py b/modelopt/torch/export/model_utils.py
index 3bd72d9de91..9c49cae0cf1 100755
--- a/modelopt/torch/export/model_utils.py
+++ b/modelopt/torch/export/model_utils.py
@@ -33,6 +33,9 @@
     "Qwen3Next": "qwen3next",
     "QWen": "qwen",
     "RecurrentGemma": "recurrentgemma",
+    # DiffusionGemma must come before "Gemma" — get_model_type substring-matches
+    # in order, and "gemma" is a substring of "diffusiongemma".
+    "DiffusionGemma": "diffusion_gemma",
     "Gemma3": "gemma3",
     "Gemma2": "gemma2",
     "Gemma": "gemma",
diff --git a/modelopt/torch/utils/dataset_utils.py b/modelopt/torch/utils/dataset_utils.py
index e9b53897fee..9dd608fee9e 100644
--- a/modelopt/torch/utils/dataset_utils.py
+++ b/modelopt/torch/utils/dataset_utils.py
@@ -1208,7 +1208,10 @@ def create_forward_loop(
 
 
 def model_type_is_enc_dec(model):
-    enc_dec_model_list = ["t5", "bart", "whisper"]
+    # Substring match against `model.__class__.__name__.lower()` — entries are
+    # the lowercased class-name form (no underscores). Calibration then uses
+    # `model.generate` to run the full denoising loop.
+    enc_dec_model_list = ["t5", "bart", "whisper", "diffusiongemma"]
     return any(model_name in model.__class__.__name__.lower() for model_name in enc_dec_model_list)
 
 

From 60d4ebb26b8def23dd9d19aca7a60c0740a2d9ec Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Thu, 28 May 2026 00:47:36 +0000
Subject: [PATCH 2/8] moe export: alias bit-identical per-expert buffers
 between tied modules
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a model has multiple fused-experts modules whose 3-D source params
share storage via HF _tied_weights_keys (e.g. the encoder and decoder
transformer stacks of a block-diffusion encoder-decoder LLM like
DiffusionGemma4 / northbloom), the unpacking loop in
_export_fused_experts ordinarily creates fresh per-expert tensors for
each call — destroying the tied identity and writing two full sets of
expert weights + scales to disk.

This adds a function-local cache keyed by
(gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the
existing unpacking path runs unchanged. On a cache hit, after the
normal unpacking completes, the per-expert weight / weight_scale /
weight_scale_2 buffers are re-pointed at the prior module's tensors
so they share storage. The downstream postprocess_state_dict
data_ptr()-based dedup then catches them and drops the duplicates
from the saved checkpoint.

input_scale is intentionally NOT aliased: encoder and decoder paths
have legitimately different activation distributions (verified across
all 60 tied pairs of a 512-prompt calibration — down_proj_input
divergence median 2.26x, max 18.4x), so each side keeps its own
per-side calibrated scale.

Model-agnostic: no name regex, no model-type lookup. Cache miss falls
through to existing behavior, so non-tied models are unaffected.

Empirical result on DiffusionGemma4 26B at nvfp4_experts_only:
  safetensors    28.43 GB -> 16.47 GB  (-42%)
  shards               4 -> 2
  decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each
  decoder input_scale entries: 3840 -> 3840 (kept per-side, as
  intended)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 modelopt/torch/export/moe_utils.py | 50 ++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/modelopt/torch/export/moe_utils.py b/modelopt/torch/export/moe_utils.py
index e325e5346f1..059955f81c6 100644
--- a/modelopt/torch/export/moe_utils.py
+++ b/modelopt/torch/export/moe_utils.py
@@ -42,6 +42,13 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
            {E}.gate_proj.weight, {E}.gate_proj.weight_scale, ...
            {E}.up_proj.weight, {E}.up_proj.weight_scale, ...
            {E}.down_proj.weight, {E}.down_proj.weight_scale, ...
+
+    Tied-experts dedup: when multiple fused-expert modules share their 3-D
+    source params via HF ``_tied_weights_keys``, the unpacking creates fresh
+    per-expert tensors that break the tie. We cache the source ``data_ptr()``
+    at entry and on a later cache hit alias the per-expert ``weight`` /
+    ``weight_scale`` / ``weight_scale_2`` back to the prior module so
+    downstream dedup catches them. ``input_scale`` is left per-side.
     """
     from modelopt.torch.export.unified_export_hf import _export_quantized_weight
     from modelopt.torch.quantization.plugins.huggingface import _get_fused_expert_intermediate_dim
@@ -49,6 +56,10 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
     n = module.num_experts
     expert_dim = _get_fused_expert_intermediate_dim(module)
 
+    # Capture source tensor identities BEFORE unpacking (the source
+    # attrs are deleted at the end of this function).
+    _source_key = (module.gate_up_proj.data_ptr(), module.down_proj.data_ptr())
+
     # 1. Shared input quantizers — one per projection type, shared across all experts.
     gate_up_input_q = module.gate_up_proj_input_quantizer
     down_input_q = module.down_proj_input_quantizer
@@ -178,6 +189,45 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
         if hasattr(module, attr):
             delattr(module, attr)
 
+    # 5. Tied-experts dedup: if this module's source params have been seen
+    # before, alias the bit-identical per-expert buffers (weight,
+    # weight_scale, weight_scale_2) to the previously-unpacked module.
+    # input_scale is left per-side so encoder/decoder calibration stays
+    # accurate where their activation distributions diverge.
+    _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {})
+    _prior = _cache.get(_source_key)
+    if _prior is not None and _prior is not module:
+        for _idx in range(n):
+            _cur_expert = getattr(module, str(_idx), None)
+            _prior_expert = getattr(_prior, str(_idx), None)
+            if _cur_expert is None or _prior_expert is None:
+                continue
+            for _proj_name in ("gate_proj", "up_proj", "down_proj"):
+                _cur_proj = getattr(_cur_expert, _proj_name, None)
+                _prior_proj = getattr(_prior_expert, _proj_name, None)
+                if _cur_proj is None or _prior_proj is None:
+                    continue
+                # Alias the weight (Parameter) so both sides reference the
+                # same nn.Parameter → same data_ptr() → existing dedup
+                # in postprocess_state_dict will drop the duplicate.
+                if hasattr(_prior_proj, "weight"):
+                    _cur_proj.weight = _prior_proj.weight
+                # Alias the bit-identical scale buffers. Re-register to
+                # ensure data_ptr() matches the prior side's tensor.
+                for _attr in ("weight_scale", "weight_scale_2"):
+                    if not hasattr(_prior_proj, _attr):
+                        continue
+                    if _attr in _cur_proj._buffers:
+                        del _cur_proj._buffers[_attr]
+                    elif hasattr(_cur_proj, _attr):
+                        delattr(_cur_proj, _attr)
+                    _cur_proj.register_buffer(_attr, getattr(_prior_proj, _attr))
+                # input_scale intentionally NOT aliased — per-side amaxes
+                # are legitimately different (encoder vs decoder activation
+                # distributions diverge, sometimes >10x — see Q2 analysis).
+    else:
+        _cache[_source_key] = module
+
 
 def save_expert_token_count_table(model: nn.Module, output_dir: str | Path | None = None):
     """Collect expert_token_count from all quantized MoE layers and save as an HTML table.

From 225072b98528b7e2f82dd641ccbfa2ad62e07fb7 Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Thu, 28 May 2026 04:25:13 +0000
Subject: [PATCH 3/8] export: alias bit-identical buffers between tied dense
 Linears
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Symmetric companion to the data_ptr-cache alias added to
_export_fused_experts in the prior commit. Catches plain nn.Linear
modules whose .weight Parameters are tied via HF tie_weights() (e.g.
encoder and decoder attention QKV/O, MLP gate/up/down, router proj of
an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they
are quantized under recipes that route through _export_quantized_weight.

Mechanism: capture weight.data_ptr() at the top of the function,
before the setattr further down wraps the packed bytes in a fresh
nn.Parameter (which destroys the tie). At the end of the function,
consult a function-local cache keyed by that captured data_ptr. On
cache miss: register this sub_module as the canonical owner. On cache
hit (a previously-processed module shared the same source weight
memory): alias .weight, weight_scale, weight_scale_2 to the prior
module's tensors so downstream data_ptr-based dedup in
postprocess_state_dict drops the duplicates. input_scale is
intentionally NOT aliased — calibration legitimately diverges per-side
(verified in Q2 analysis: down_proj_input ratio up to 18x across 60
tied pairs on the northbloom DiffusionGemma4 model).

Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense
Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach
this function but have fresh data_ptrs from upstream slice+contiguous
so cache misses always). Under full nvfp4 it fires for every tied
dense Linear pair.

Same safety guarantees as the existing data_ptr-based dedup: cannot
false-positive because it only aliases when source memory was already
shared by the model author (via tie_weights or equivalent). No name
regex, no _tied_weights_keys lookup, no model introspection.

Empirical results on DiffusionGemma4 26B (calib_size=8 smoke):
  nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior
                                            experts-only export; this
                                            patch a no-op as expected)
  nvfp4 (full):       ~27 GB est -> 14.24 GB (-12.7 GB; both patches
                                              firing on disjoint
                                              module sets; per-name
                                              dedup verified in index)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 modelopt/torch/export/unified_export_hf.py | 43 ++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py
index ef5757aa0cb..1b0e0cb95db 100644
--- a/modelopt/torch/export/unified_export_hf.py
+++ b/modelopt/torch/export/unified_export_hf.py
@@ -520,6 +520,14 @@ def _export_quantized_weight(
 
     The export includes converting weight tensor to correct quantized values and quantized dtype,
     and registering scaling factors.
+
+    Tied-weight dedup: the setattr below replaces ``.weight`` with a fresh
+    ``nn.Parameter`` wrapping packed bytes, breaking any HF-level tie.
+    We capture ``weight.data_ptr()`` before the replacement and consult a
+    function-local cache at the end; on cache hit, ``weight`` / ``weight_scale`` /
+    ``weight_scale_2`` are re-pointed at the previously-processed module so the
+    downstream data_ptr dedup catches them. Uses memory identity only — no
+    ``_tied_weights_keys`` lookup, no-op for non-tied modules.
     """
     quantization_format = get_quantization_format(sub_module)
     if quantization_format == QUANTIZATION_NONE:
@@ -528,6 +536,13 @@ def _export_quantized_weight(
     block_size = get_weight_block_size(sub_module, weight_name)
     quantizer_attrs = quantizer_attr_names(weight_name)
     weight: nn.Parameter = getattr(sub_module, weight_name)
+
+    # Capture source identity BEFORE any tensor-creating operation below.
+    # For HF-tied weights this matches across all modules sharing the
+    # underlying Parameter; the cache lookup at the end of this function
+    # uses it to detect ties whose Python identity is about to be broken
+    # by the setattr on `weight_name` further down.
+    _tied_source_data_ptr = weight.data_ptr()
     weight_quantizer: TensorQuantizer | SequentialQuantizer = getattr(
         sub_module, quantizer_attrs.weight_quantizer
     )
@@ -703,6 +718,34 @@ def _export_quantized_weight(
     if weight_scale is not None:
         sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scale)
 
+    # Tied-weight dedup: if a previously-processed module shared the same
+    # source weight memory, alias the bit-identical packed weight and scale
+    # buffers from that prior module so downstream data_ptr-based dedup in
+    # postprocess_state_dict can collapse the duplicates. Per-side
+    # input_scale is intentionally NOT aliased — activation amaxes
+    # legitimately differ across tied modules whose forward paths see
+    # different distributions.
+    _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {})
+    _prior = _cache.get(_tied_source_data_ptr)
+    if _prior is not None and _prior is not sub_module:
+        # Alias the packed weight (same nn.Parameter -> same data_ptr).
+        if hasattr(_prior, weight_name):
+            setattr(sub_module, weight_name, getattr(_prior, weight_name))
+        # Alias bit-identical scale buffers (NOT input_scale).
+        for _attr in (
+            quantizer_attrs.weight_scale,
+            quantizer_attrs.weight_scale_2,
+        ):
+            if _attr is None or not hasattr(_prior, _attr):
+                continue
+            if _attr in sub_module._buffers:
+                del sub_module._buffers[_attr]
+            elif hasattr(sub_module, _attr):
+                delattr(sub_module, _attr)
+            sub_module.register_buffer(_attr, getattr(_prior, _attr))
+    else:
+        _cache[_tied_source_data_ptr] = sub_module
+
     torch.cuda.empty_cache()
 
 

From a0d9b650686694bb6a8c64cbe06e52155b15d06a Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Fri, 29 May 2026 00:37:25 +0000
Subject: [PATCH 4/8] export: opt-in reorder of tied-weight aliases to
 canonical-side names

Adds an opt-in pass (--canonical_tied_naming, default off) that reorders
the state_dict before postprocess_state_dict so that keys matching the
canonical side of HF's _tied_weights_keys declaration iterate before
their aliases. The existing first-wins data_ptr dedup at
quant_utils.py:1148-1163 then drops the alias names, leaving the
canonical names in the exported safetensors.

Motivation: for models like DiffusionGemma4 (northbloom), HF declares
{alias: canonical} via _tied_weights_keys, where the encoder side is
the alias and the decoder side is canonical. The original HF
safetensors index uses decoder-prefixed names for all tied weights
(661/691 keys vs 30/691 encoder-only layer_scalar keys). The
single-backbone vLLM mockup loader strips both prefixes to model.* and
relies on this canonical naming.

The default modelopt export today walks the model in registration
order (encoder before decoder, per the model's __init__ order), so
encoder names win in the first-wins dedup. The exported checkpoint
thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards
relative to the upstream HF naming and to what downstream consumers
expect.

Implementation. _tied_weights_keys is declared per model class with
paths relative to that class. In nested models (e.g. DiffusionGemma4)
multiple submodules declare their own ties: the outer wrapper at
DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to
model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model
at model.model declares the much larger encoder<->decoder dict with
paths relative to itself.

_collect_canonical_tied_patterns walks model.named_modules() and
collects every dict-style _tied_weights_keys declaration, prefixing
each pattern with the submodule's qualified path so the regexes match
against root-level state_dict keys. Without the prefix, the inner
dict's patterns (which lack a "model." prefix) silently fail to match
keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug
that would cause only the outer dict's single entry (embed_tokens) to
flip in the dedup.

_reorder_canonical_first then partitions the state_dict into head
(canonical-pattern matches) and tail (everything else), preserving
original order within each partition. head.update(tail) yields a
single dict with canonical keys first. The downstream dedup loop
iterates this in insertion order and records the canonical names in
seen_tensors; alias names then arrive as duplicates and are dropped.

Scope of behavior change:
- Models with no _tied_weights_keys, or only legacy list-of-strings
  declarations: _collect returns an empty pattern list, helper
  short-circuits, state_dict returned unchanged. Zero effect on any
  existing modelopt user.
- Models with dict-style declarations (e.g. DiffusionGemma4): when
  the flag is set, canonical-side names win. When the flag is unset
  (default), behavior is identical to before this commit.

No changes to dedup logic itself, to the existing tied-weight alias
patches in _export_quantized_weight and _export_fused_experts, or to
_process_quantized_modules iteration. Strictly additive.

Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size
32: 35 127 encoder keys removed (decoder kept), 0 decoder keys
removed; layer_scalar (30 encoder-only keys) and per-side input_scale
(11 520 keys, intentionally not deduped) unaffected; total safetensors
bytes 17.68 GB matching the prior export to within 500 KB of
safetensors-metadata-ordering noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 examples/llm_ptq/hf_ptq.py                 | 14 ++++
 modelopt/torch/export/unified_export_hf.py | 91 +++++++++++++++++++++-
 2 files changed, 103 insertions(+), 2 deletions(-)

diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py
index 6b7a5d773a6..14215853eaa 100755
--- a/examples/llm_ptq/hf_ptq.py
+++ b/examples/llm_ptq/hf_ptq.py
@@ -774,6 +774,7 @@ def export_quantized(
                     full_model,
                     export_dir=export_path,
                     extra_state_dict=mtp_state_dict,
+                    canonical_tied_naming=args.canonical_tied_naming,
                 )
 
                 if args.qformat == "w4a16_nvfp4":
@@ -1257,6 +1258,19 @@ def parse_args() -> argparse.Namespace:
         default=512,
     )
     parser.add_argument("--export_path", default="exported_model")
+    parser.add_argument(
+        "--canonical_tied_naming",
+        type=lambda s: s.lower() in ("1", "true", "yes"),
+        default=False,
+        help=(
+            "If True, reorder the exported state_dict so tied-weight aliases "
+            "dedup to the canonical side declared in the model's HF "
+            "_tied_weights_keys (e.g. decoder-side for DiffusionGemma4). Off "
+            "by default to avoid renaming exported keys for models whose "
+            "downstream consumers expect the legacy (registration-order) "
+            "winner."
+        ),
+    )
     parser.add_argument(
         "--dataset",
         help=(
diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py
index 1b0e0cb95db..72cf2dce4f3 100644
--- a/modelopt/torch/export/unified_export_hf.py
+++ b/modelopt/torch/export/unified_export_hf.py
@@ -749,6 +749,71 @@ def _export_quantized_weight(
     torch.cuda.empty_cache()
 
 
+def _collect_canonical_tied_patterns(model: nn.Module) -> list[re.Pattern]:
+    """Walk the model and collect canonical-side tied-weight patterns.
+
+    HF's ``_tied_weights_keys`` is declared per model class with paths
+    relative to that class. In nested models, each submodule may declare
+    its own ties (e.g. an outer wrapper ties ``lm_head.weight`` to
+    ``model.decoder.embed_tokens.weight``, while the inner
+    ``DiffusionGemma4Model`` at ``model.model`` declares the much larger
+    encoder↔decoder dict, with paths relative to itself such as
+    ``encoder.language_model.layers...weight`` ↔ ``decoder.layers...weight``).
+
+    To match against the root model's state_dict keys we must prefix each
+    submodule's patterns with its qualified path (``model.``). Without this
+    prefix, the inner dict's patterns (which lack the ``model.`` prefix)
+    silently fail to match real keys like
+    ``model.decoder.layers.0.self_attn.q_proj.weight``.
+
+    Returns a list of compiled regex patterns for the canonical side of
+    every dict-style ``_tied_weights_keys`` declaration found anywhere in
+    the module tree. List-style (legacy) declarations are skipped — they
+    carry no canonical/alias distinction.
+    """
+    patterns: list[re.Pattern] = []
+    for name, submodule in model.named_modules():
+        tied = getattr(submodule, "_tied_weights_keys", None)
+        if not isinstance(tied, dict) or not tied:
+            continue
+        prefix = f"{name}." if name else ""
+        patterns.extend(re.compile(prefix + p) for p in tied.values())
+    return patterns
+
+
+def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict:
+    """Reorder ``state_dict`` so canonical-side tied keys iterate first.
+
+    For models that declare ``_tied_weights_keys`` as a ``{alias_pattern:
+    canonical_pattern}`` dict (newer HF style, e.g. ``DiffusionGemma4``),
+    HF designates one side of each tied pair as canonical and the other
+    as an alias. The downstream data_ptr dedup in
+    :func:`postprocess_state_dict` keeps whichever key it sees first per
+    ``data_ptr``, which by default is registration order — and that is
+    often the alias side, not the canonical side declared by HF.
+
+    This helper rebuilds the dict with canonical-pattern-matching keys
+    moved to the front (preserving original order within each partition),
+    so the existing first-wins dedup picks the canonical side.
+
+    No-op when the model declares no dict-style ``_tied_weights_keys``
+    anywhere in its module tree (i.e. only legacy list-of-strings
+    declarations, or no ties at all).
+    """
+    canonical_patterns = _collect_canonical_tied_patterns(model)
+    if not canonical_patterns:
+        return state_dict
+    head: dict = {}
+    tail: dict = {}
+    for k, v in state_dict.items():
+        if any(p.search(k) for p in canonical_patterns):
+            head[k] = v
+        else:
+            tail[k] = v
+    head.update(tail)
+    return head
+
+
 def _process_quantized_modules(
     model: nn.Module,
     dtype: torch.dtype,
@@ -858,7 +923,11 @@ def _process_quantized_modules(
 
 
 def _export_transformers_checkpoint(
-    model: nn.Module, dtype: torch.dtype | None = None, is_modelopt_qlora: bool = False, **kwargs
+    model: nn.Module,
+    dtype: torch.dtype | None = None,
+    is_modelopt_qlora: bool = False,
+    canonical_tied_naming: bool = False,
+    **kwargs,
 ) -> tuple[dict[str, Any], dict[str, Any]]:
     """Exports the torch model to the packed checkpoint with original HF naming.
 
@@ -995,6 +1064,16 @@ def _export_transformers_checkpoint(
     # We define kv cache scale as amax / 448 for both FP8 and NVFP4 KV cache quantization.
     kv_cache_max_bound = 448
     kv_cache_format = quant_config["quantization"]["kv_cache_quant_algo"]
+
+    # Optionally reorder so canonical-side tied keys (per HF's
+    # _tied_weights_keys) iterate first into postprocess_state_dict's
+    # first-wins data_ptr dedup. Off by default to avoid renaming exported
+    # keys for models whose downstream consumers expect the legacy
+    # (registration-order) winner; opt in for models where matching HF's
+    # own naming convention matters (e.g. DiffusionGemma4 → decoder names).
+    if canonical_tied_naming:
+        quantized_state_dict = _reorder_canonical_first(quantized_state_dict, model)
+
     quantized_state_dict = postprocess_state_dict(
         quantized_state_dict, kv_cache_max_bound, kv_cache_format, is_modelopt_qlora
     )
@@ -1332,6 +1411,7 @@ def export_hf_checkpoint(
     components: list[str] | None = None,
     extra_state_dict: dict[str, torch.Tensor] | None = None,
     max_shard_size: int | str = "10GB",
+    canonical_tied_naming: bool = False,
     **kwargs,
 ):
     """Export quantized HuggingFace model checkpoint (transformers or diffusers).
@@ -1351,6 +1431,11 @@ def export_hf_checkpoint(
             to export. If None, all quantized components are exported.
         extra_state_dict: Extra state dictionary to add to the exported model.
         max_shard_size: Maximum size of each safetensors shard file. Defaults to "10GB".
+        canonical_tied_naming: If True, reorder the state_dict so tied-weight
+            aliases dedup to the canonical side declared in the model's HF
+            ``_tied_weights_keys`` (e.g. decoder-side for DiffusionGemma4).
+            Off by default to avoid renaming exported keys for models whose
+            downstream consumers expect the legacy (registration-order) winner.
         **kwargs: Runtime-specific post-processing options forwarded to
             :func:`_postprocess_safetensors` for diffusion model exports.
             See its docstring for supported keys.
@@ -1373,7 +1458,9 @@ def export_hf_checkpoint(
         return
 
     try:
-        post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype)
+        post_state_dict, hf_quant_config = _export_transformers_checkpoint(
+            model, dtype, canonical_tied_naming=canonical_tied_naming
+        )
 
         # Only treat the export as quantized when at least one quant_algo field is set.
         # get_quant_config always returns a dict (even for sparsity-only or unmodified models),

From e351c0f2e99f7f688920295a3482f5a6edad05b7 Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Fri, 29 May 2026 01:16:36 +0000
Subject: [PATCH 5/8] export: max-merge tied input_quantizer amaxes; alias
 input_scale buffers

Companion piece to the existing tied-weight alias patches in
_export_fused_experts (commit 8b00e85bb) and _export_quantized_weight
(commit e8c36b024), which already alias bit-identical weight /
weight_scale / weight_scale_2 between tied modules but leave input_scale
per-side. This commit closes the loop on input_scale so consumers that
load a single canonical scale per Linear (e.g. vLLM's single-backbone
DiffusionGemma4 mockup) see a value consistent across all tied sides.

Implementation has two parts.

1. New sync_tied_input_amax(model) helper. Walks named_modules(),
   groups by source weight data_ptr (same signature our existing dedup
   patches use), and max-merges input_quantizer.amax across each
   group. Uses the canonical 4-line idiom shared with
   preprocess_linear_fusion (quant_utils.py:1394-1401) and
   sync_moe_gate_up_amax (layer_utils.py:1197):

     merged = torch.max(torch.stack([q.amax for q in qs]))
     for q in qs:
         q.amax = merged.clone()

   Handles both dense Linears (keyed by weight.data_ptr) and fused MoE
   modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging
   gate_up_proj_input_quantizer and down_proj_input_quantizer
   independently across the group). Scalar-only, matching
   preprocess_linear_fusion's contract.

   Called unconditionally from _export_transformers_checkpoint after
   sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the
   merged amax flows into _export_quantized_weight's input_scale
   derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when
   applicable, no-op otherwise" convention.

2. Extend the existing tied-weight alias loops in
   _export_quantized_weight and _export_fused_experts to include
   input_scale alongside weight_scale / weight_scale_2. Before this
   commit those loops intentionally skipped input_scale because
   encoder/decoder amaxes legitimately differed (Q2 analysis showed up
   to 18x divergence for down_proj_input on v1). With
   sync_tied_input_amax in place, both sides now derive bit-identical
   input_scale values; aliasing the buffers is safe and lets the
   existing data_ptr dedup in postprocess_state_dict collapse them so
   only one canonical entry per Linear survives in the exported
   safetensors.

Also extends the Q-B canonical-side reorder pass added in commit
837768fe3 with an auto-derived side-substring matcher. HF's
_tied_weights_keys regex patterns target the pre-export module
structure (fused gate_up_proj), but after _export_fused_experts
unpacks them into per-expert gate_proj/up_proj/down_proj submodules,
post-export keys like ...experts.Y.gate_proj.input_scale are not
covered by HF's regex. Without the substring fallback, those keys
fell through Q-B to the "alias-first" partition, so when the new
input_scale alias step shared data_ptrs, the encoder name won the
dedup instead of the decoder name.

_collect_canonical_tied_patterns now returns (patterns,
side_substrings). The side_substrings list is auto-derived from each
_tied_weights_keys entry as the set of dot-separated tokens that
appear in canonical patterns but not in alias patterns. For
DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern
contains "decoder", no alias pattern does. _reorder_canonical_first
treats a key as canonical if it matches a regex pattern OR contains a
side substring as a proper path component (bordered by "." or at
start/end). The path-component requirement avoids false positives
from accidental name collisions.

Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32:
the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries
that the prior export carried are removed; the 11 520 decoder-side
entries remain with the merged amax-derived value. Total bytes drops
by ~1 MB (scalar entries). Other tied-tensor entries (weight,
weight_scale, weight_scale_2) and encoder-only entries (layer_scalar,
30 keys) are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 modelopt/torch/export/moe_utils.py         |  19 ++-
 modelopt/torch/export/unified_export_hf.py | 186 +++++++++++++++------
 2 files changed, 145 insertions(+), 60 deletions(-)

diff --git a/modelopt/torch/export/moe_utils.py b/modelopt/torch/export/moe_utils.py
index 059955f81c6..3b0e49fe54a 100644
--- a/modelopt/torch/export/moe_utils.py
+++ b/modelopt/torch/export/moe_utils.py
@@ -191,9 +191,11 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
 
     # 5. Tied-experts dedup: if this module's source params have been seen
     # before, alias the bit-identical per-expert buffers (weight,
-    # weight_scale, weight_scale_2) to the previously-unpacked module.
-    # input_scale is left per-side so encoder/decoder calibration stays
-    # accurate where their activation distributions diverge.
+    # weight_scale, weight_scale_2, input_scale) to the previously-unpacked
+    # module. input_scale is safe to alias because sync_tied_input_amax
+    # runs earlier in _export_transformers_checkpoint and max-merges the
+    # shared input_quantizer amaxes across tied fused-experts modules, so
+    # both sides now derive bit-identical input_scale values.
     _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {})
     _prior = _cache.get(_source_key)
     if _prior is not None and _prior is not module:
@@ -212,9 +214,11 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
                 # in postprocess_state_dict will drop the duplicate.
                 if hasattr(_prior_proj, "weight"):
                     _cur_proj.weight = _prior_proj.weight
-                # Alias the bit-identical scale buffers. Re-register to
-                # ensure data_ptr() matches the prior side's tensor.
-                for _attr in ("weight_scale", "weight_scale_2"):
+                # Alias the bit-identical scale buffers (including
+                # input_scale, made safe by sync_tied_input_amax pre-export
+                # merging). Re-register to ensure data_ptr() matches the
+                # prior side's tensor.
+                for _attr in ("weight_scale", "weight_scale_2", "input_scale"):
                     if not hasattr(_prior_proj, _attr):
                         continue
                     if _attr in _cur_proj._buffers:
@@ -222,9 +226,6 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
                     elif hasattr(_cur_proj, _attr):
                         delattr(_cur_proj, _attr)
                     _cur_proj.register_buffer(_attr, getattr(_prior_proj, _attr))
-                # input_scale intentionally NOT aliased — per-side amaxes
-                # are legitimately different (encoder vs decoder activation
-                # distributions diverge, sometimes >10x — see Q2 analysis).
     else:
         _cache[_source_key] = module
 
diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py
index 72cf2dce4f3..89973facce1 100644
--- a/modelopt/torch/export/unified_export_hf.py
+++ b/modelopt/torch/export/unified_export_hf.py
@@ -719,22 +719,19 @@ def _export_quantized_weight(
         sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scale)
 
     # Tied-weight dedup: if a previously-processed module shared the same
-    # source weight memory, alias the bit-identical packed weight and scale
-    # buffers from that prior module so downstream data_ptr-based dedup in
-    # postprocess_state_dict can collapse the duplicates. Per-side
-    # input_scale is intentionally NOT aliased — activation amaxes
-    # legitimately differ across tied modules whose forward paths see
-    # different distributions.
+    # source weight memory, alias the packed weight + scale buffers so the
+    # downstream data_ptr dedup in postprocess_state_dict can collapse them.
+    # input_scale is safe to alias because sync_tied_input_amax (earlier in
+    # this export) already max-merged the per-side amaxes.
     _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {})
     _prior = _cache.get(_tied_source_data_ptr)
     if _prior is not None and _prior is not sub_module:
-        # Alias the packed weight (same nn.Parameter -> same data_ptr).
         if hasattr(_prior, weight_name):
             setattr(sub_module, weight_name, getattr(_prior, weight_name))
-        # Alias bit-identical scale buffers (NOT input_scale).
         for _attr in (
             quantizer_attrs.weight_scale,
             quantizer_attrs.weight_scale_2,
+            quantizer_attrs.input_scale,
         ):
             if _attr is None or not hasattr(_prior, _attr):
                 continue
@@ -749,64 +746,72 @@ def _export_quantized_weight(
     torch.cuda.empty_cache()
 
 
-def _collect_canonical_tied_patterns(model: nn.Module) -> list[re.Pattern]:
-    """Walk the model and collect canonical-side tied-weight patterns.
-
-    HF's ``_tied_weights_keys`` is declared per model class with paths
-    relative to that class. In nested models, each submodule may declare
-    its own ties (e.g. an outer wrapper ties ``lm_head.weight`` to
-    ``model.decoder.embed_tokens.weight``, while the inner
-    ``DiffusionGemma4Model`` at ``model.model`` declares the much larger
-    encoder↔decoder dict, with paths relative to itself such as
-    ``encoder.language_model.layers...weight`` ↔ ``decoder.layers...weight``).
-
-    To match against the root model's state_dict keys we must prefix each
-    submodule's patterns with its qualified path (``model.``). Without this
-    prefix, the inner dict's patterns (which lack the ``model.`` prefix)
-    silently fail to match real keys like
-    ``model.decoder.layers.0.self_attn.q_proj.weight``.
-
-    Returns a list of compiled regex patterns for the canonical side of
-    every dict-style ``_tied_weights_keys`` declaration found anywhere in
-    the module tree. List-style (legacy) declarations are skipped — they
-    carry no canonical/alias distinction.
+def _collect_canonical_tied_patterns(
+    model: nn.Module,
+) -> tuple[list[re.Pattern], list[str]]:
+    """Walk the model and collect canonical-side tied-weight matchers.
+
+    Patterns are submodule-prefixed regexes from each module's
+    ``_tied_weights_keys`` dict-style declaration (the prefix matters
+    for nested models where the dict lives on an inner submodule).
+    Side substrings are dot-separated tokens that appear only on the
+    canonical side of those declarations — needed because modelopt's
+    per-expert unpacking creates post-export keys (e.g.
+    ``…experts.Y.gate_proj.input_scale``) that HF's regexes never knew
+    about. List-style (legacy) declarations are skipped.
     """
     patterns: list[re.Pattern] = []
+    alias_token_set: set[str] = set()
+    canonical_token_set: set[str] = set()
+
+    def _tokens(s: str) -> set[str]:
+        """Identifiers in a regex string, with regex specials as separators."""
+        return {tok for tok in re.split(r"[^A-Za-z0-9_]+", s) if tok}
+
     for name, submodule in model.named_modules():
         tied = getattr(submodule, "_tied_weights_keys", None)
         if not isinstance(tied, dict) or not tied:
             continue
         prefix = f"{name}." if name else ""
-        patterns.extend(re.compile(prefix + p) for p in tied.values())
-    return patterns
+        for alias_pat, canonical_pat in tied.items():
+            patterns.append(re.compile(prefix + canonical_pat))
+            alias_token_set.update(_tokens(prefix + alias_pat))
+            canonical_token_set.update(_tokens(prefix + canonical_pat))
+
+    # Tokens unique to the canonical side become substring matchers.
+    side_substrings = sorted(canonical_token_set - alias_token_set)
+    return patterns, side_substrings
 
 
 def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict:
-    """Reorder ``state_dict`` so canonical-side tied keys iterate first.
-
-    For models that declare ``_tied_weights_keys`` as a ``{alias_pattern:
-    canonical_pattern}`` dict (newer HF style, e.g. ``DiffusionGemma4``),
-    HF designates one side of each tied pair as canonical and the other
-    as an alias. The downstream data_ptr dedup in
-    :func:`postprocess_state_dict` keeps whichever key it sees first per
-    ``data_ptr``, which by default is registration order — and that is
-    often the alias side, not the canonical side declared by HF.
-
-    This helper rebuilds the dict with canonical-pattern-matching keys
-    moved to the front (preserving original order within each partition),
-    so the existing first-wins dedup picks the canonical side.
-
-    No-op when the model declares no dict-style ``_tied_weights_keys``
-    anywhere in its module tree (i.e. only legacy list-of-strings
-    declarations, or no ties at all).
+    r"""Reorder ``state_dict`` so canonical-side tied keys iterate first.
+
+    Lets the downstream first-wins data_ptr dedup keep canonical names.
+    Uses both regex patterns and substring matchers from
+    :func:`_collect_canonical_tied_patterns`. No-op when the model
+    declares no dict-style ``_tied_weights_keys``.
     """
-    canonical_patterns = _collect_canonical_tied_patterns(model)
-    if not canonical_patterns:
+    canonical_patterns, side_substrings = _collect_canonical_tied_patterns(model)
+    if not canonical_patterns and not side_substrings:
         return state_dict
+
+    def _has_side_substring(key: str) -> bool:
+        # Require the token to appear as a proper dot-separated path
+        # component, not just as a substring of an unrelated identifier.
+        for tok in side_substrings:
+            if (
+                f".{tok}." in key
+                or key.startswith(f"{tok}.")
+                or key.endswith(f".{tok}")
+                or key == tok
+            ):
+                return True
+        return False
+
     head: dict = {}
     tail: dict = {}
     for k, v in state_dict.items():
-        if any(p.search(k) for p in canonical_patterns):
+        if any(p.search(k) for p in canonical_patterns) or _has_side_substring(k):
             head[k] = v
         else:
             tail[k] = v
@@ -814,6 +819,76 @@ def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict:
     return head
 
 
+def sync_tied_input_amax(model: nn.Module) -> int:
+    """Max-merge input_quantizer amaxes across modules sharing a weight ``data_ptr``.
+
+    Closes the loop on ``input_scale`` for HF-tied modules whose forward
+    paths see different activation distributions (encoder vs decoder in
+    YOCO-style models). Must run BEFORE per-module export so the merged
+    amax flows into ``input_scale`` derivation. Handles both dense
+    Linears (keyed by ``weight.data_ptr()``) and fused MoE (keyed by
+    ``(gate_up_proj, down_proj)`` data_ptr tuple). Returns the number of
+    tied groups merged.
+    """
+    from collections import defaultdict
+
+    by_dp: dict = defaultdict(list)
+    for _, m in model.named_modules():
+        # Fused MoE: 3-D source tensors with shared input quantizers
+        if (
+            hasattr(m, "gate_up_proj_input_quantizer")
+            and hasattr(m, "gate_up_proj")
+            and hasattr(m, "down_proj")
+            and m.gate_up_proj.dim() == 3
+        ):
+            key = ("moe", m.gate_up_proj.data_ptr(), m.down_proj.data_ptr())
+            by_dp[key].append(m)
+        # Dense quantized Linear with an input_quantizer
+        elif (
+            hasattr(m, "input_quantizer")
+            and hasattr(m, "weight")
+            and isinstance(m.weight, torch.nn.Parameter)
+        ):
+            by_dp[("dense", m.weight.data_ptr())].append(m)
+
+    def _merge(quantizers: list) -> bool:
+        """Max-merge amaxes across the quantizer list. Returns True on merge."""
+        valid = [
+            q
+            for q in quantizers
+            if q is not None
+            and getattr(q, "is_enabled", False)
+            and getattr(q, "_amax", None) is not None
+            and not q._amax.is_meta
+        ]
+        if len(valid) < 2:
+            return False
+        # Require scalar (per-tensor) amax — matches preprocess_linear_fusion.
+        if any(q._amax.numel() != 1 for q in valid):
+            warnings.warn(
+                "sync_tied_input_amax: non-scalar input_quantizer amax encountered "
+                "in a tied group; skipping. Only per-tensor input quantizers are "
+                "supported for tied-modules merging."
+            )
+            return False
+        merged = torch.max(torch.stack([q.amax for q in valid]))
+        for q in valid:
+            q.amax = merged.clone()
+        return True
+
+    synced = 0
+    for key, modules in by_dp.items():
+        if len(modules) < 2:
+            continue
+        if key[0] == "moe":
+            for q_name in ("gate_up_proj_input_quantizer", "down_proj_input_quantizer"):
+                if _merge([getattr(m, q_name, None) for m in modules]):
+                    synced += 1
+        elif _merge([m.input_quantizer for m in modules]):
+            synced += 1
+    return synced
+
+
 def _process_quantized_modules(
     model: nn.Module,
     dtype: torch.dtype,
@@ -1047,6 +1122,15 @@ def _export_transformers_checkpoint(
             f"Taking element-wise max of amaxes for serving-engine fusion."
         )
 
+    # Merge per-side input_quantizer amaxes BEFORE _process_quantized_modules,
+    # so the merged value flows into input_scale derivation downstream.
+    synced_input = sync_tied_input_amax(model)
+    if synced_input:
+        print(
+            f"sync_tied_input_amax: max-merged input_quantizer amaxes across "
+            f"{synced_input} tied module group(s)"
+        )
+
     # Process all quantized modules and export weights
     _process_quantized_modules(model, dtype, is_modelopt_qlora)
 

From d684477e877f5831f0c4825d60aba757c797a698 Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Fri, 5 Jun 2026 21:55:41 +0000
Subject: [PATCH 6/8] quantization: exclude self_conditioning from default
 disabled_quantizers

The diffusion self-conditioning network (block-diffusion models like
DiffusionGemma) is text-only and not exercised by typical calibration
data. Without exclusion its TensorQuantizers never see input, never set
_amax, and export crashes at _export_quantized_weight:

  AttributeError: 'TensorQuantizer' object has no attribute '_amax'

Companion to the upstream vision-tower / visual / embed_vision excludes
already in this unit (PR #1691). Pattern is a no-op for non-diffusion
models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 .../configs/ptq/units/default_disabled_quantizers.yaml      | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml b/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
index e2efcb5142d..776ceeb9c72 100644
--- a/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
+++ b/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
@@ -73,3 +73,9 @@
   - parent_class: 'nn.Embedding'
     quantizer_name: '*'
     enable: false
+  # Diffusion self-conditioning network: text-only and not exercised by
+  # typical calibration; without exclusion its TensorQuantizers never see
+  # input and export crashes with "AttributeError: '...' has no attribute
+  # '_amax'". Companion to the vision excludes above.
+  - quantizer_name: '*self_conditioning*'
+    enable: false

From d0a735ef0b241fbacbffb77c71e9800fa7a9fbe2 Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Sat, 13 Jun 2026 00:59:20 +0000
Subject: [PATCH 7/8] tests: cover tied-weight dedup, canonical reorder, and
 input-amax sync

Adds the tied-modules test fixture plus 10 unit tests covering the
tied-weight machinery introduced earlier in this series.

tests/_test_utils/torch/quantization/tied_modules.py (new):
  Three small factory helpers shared by the unit tests:
  - make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is
    shared via setattr (mimics HF tie_weights() after __init__).
  - tie_fused_experts_3d_params(enc, dec) -- in-place tie of
    gate_up_proj / down_proj between two fused-experts modules (paired
    with the existing _SyntheticFusedExperts fixture).
  - wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent
    nn.Module with HF-style _tied_weights_keys (dict-style for the
    canonical case, list-style for the legacy negative case).
  Each factory asserts post-conditions on the tie so a misuse fails
  loudly at construction.

tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests
  Commit f3e9543ab -- canonical-side reorder:
    - dict-style _tied_weights_keys yields patterns + canonical
      substrings
    - list-style yields no canonical info (reorder becomes a no-op)
    - _reorder_canonical_first puts decoder-side keys ahead of
      encoder-side keys

  Commit 3fb3ba053 -- sync_tied_input_amax:
    - tied Linears with divergent amaxes (2.0 vs 5.0) get both sides
      overwritten with the elementwise max (5.0)
    - untied Linears keep per-side amaxes (no-op when there's no tie)

  Commit 29674a7e1 -- dense Linear tied-weight dedup:
    - tied Linears share data_ptr for packed .weight + scale buffers
    - untied Linears keep independent data_ptrs
    - asymmetric quant: unquantized side early-returns at
      QUANTIZATION_NONE, stays at the original shared Parameter

tests/unit/torch/quantization/plugins/test_fused_experts.py (extended):
  2 tests
  Commit 10a8fdbd5 -- MoE experts dedup:
    - two _SyntheticSparseMoeBlock instances with tied 3-D source
      params share data_ptr across every per-expert buffer
    - untied counterparts keep independent per-expert data_ptrs

Pure-Python; CPU-only; ~1s wall total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 .../torch/quantization/tied_modules.py        | 115 +++++++++++
 .../torch/export/test_unified_export_hf.py    | 184 ++++++++++++++++++
 .../plugins/test_fused_experts.py             | 111 +++++++++++
 3 files changed, 410 insertions(+)
 create mode 100644 tests/_test_utils/torch/quantization/tied_modules.py
 create mode 100644 tests/unit/torch/export/test_unified_export_hf.py

diff --git a/tests/_test_utils/torch/quantization/tied_modules.py b/tests/_test_utils/torch/quantization/tied_modules.py
new file mode 100644
index 00000000000..8ea76d2d459
--- /dev/null
+++ b/tests/_test_utils/torch/quantization/tied_modules.py
@@ -0,0 +1,115 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Factories for tied-weight test scenarios.
+
+These build small synthetic modules whose ``.weight`` :class:`nn.Parameter` is
+shared between two sibling modules — mimicking HuggingFace's
+``_tied_weights_keys`` machinery — for unit-testing the export-time dedup,
+canonical-side naming, and per-side ``input_quantizer.amax`` merge logic in
+the HF export path.
+
+Every factory returns CPU-resident, float32-default modules; no GPU required.
+Each factory asserts its own post-conditions before returning, so a broken
+tie surfaces as a clear factory-side error rather than as a downstream test
+failure with an ambiguous cause.
+"""
+
+import re
+
+import torch.nn as nn
+
+
+def make_tied_linear_pair(
+    in_features: int = 16,
+    out_features: int = 32,
+    bias: bool = False,
+) -> tuple[nn.Linear, nn.Linear]:
+    """Two :class:`nn.Linear` modules whose ``.weight`` Parameter is shared.
+
+    Mimics what HuggingFace's :meth:`PreTrainedModel.tie_weights` does after
+    ``__init__``: one extra ``setattr`` so that both modules' ``.weight``
+    attributes resolve to the same :class:`nn.Parameter` and therefore the
+    same underlying storage. The modules are otherwise independent — separate
+    biases (if requested), separate forward/training state, separate
+    quantizer slots when ``mtq.quantize`` inserts them later.
+    """
+    enc = nn.Linear(in_features, out_features, bias=bias)
+    dec = nn.Linear(in_features, out_features, bias=bias)
+    dec.weight = enc.weight  # mimics HF tie_weights()
+
+    # Post-conditions — fail loudly if the tie was somehow lost.
+    assert enc.weight is dec.weight, "Linear weights not tied (object identity)"
+    assert enc.weight.data_ptr() == dec.weight.data_ptr(), (
+        "Linear weights tied at object level but storage diverged"
+    )
+    return enc, dec
+
+
+def tie_fused_experts_3d_params(enc: nn.Module, dec: nn.Module) -> None:
+    """Tie ``gate_up_proj`` and ``down_proj`` between two fused-experts modules.
+
+    Mutates ``dec`` in place. After calling, ``dec.gate_up_proj`` IS
+    ``enc.gate_up_proj`` (same :class:`nn.Parameter`) and likewise for
+    ``down_proj``. Used by MoE-dedup tests together with the
+    ``_SyntheticFusedExperts`` fixture defined in
+    ``tests/unit/torch/quantization/plugins/test_fused_experts.py``.
+    """
+    dec.gate_up_proj = enc.gate_up_proj
+    dec.down_proj = enc.down_proj
+
+    assert enc.gate_up_proj is dec.gate_up_proj, "gate_up_proj not tied"
+    assert enc.down_proj is dec.down_proj, "down_proj not tied"
+    assert enc.gate_up_proj.data_ptr() == dec.gate_up_proj.data_ptr()
+    assert enc.down_proj.data_ptr() == dec.down_proj.data_ptr()
+
+
+def wrap_in_parent_with_tied_keys(
+    enc: nn.Module,
+    dec: nn.Module,
+    *,
+    decoder_canonical: bool = True,
+    weight_attr: str = "weight",
+) -> nn.Module:
+    """Wrap two tied modules in a parent that declares HF ``_tied_weights_keys``.
+
+    Returns a parent :class:`nn.Module` with:
+
+    - ``parent.encoder = enc`` — registered as a submodule (alias side).
+    - ``parent.decoder = dec`` — registered as a submodule (canonical side
+      when ``decoder_canonical=True``, the default and DiffusionGemma-like case).
+    - ``parent._tied_weights_keys``: dict-style ``{alias_regex: canonical}``
+      when ``decoder_canonical=True``, list-style (legacy, no canonical/alias
+      distinction) when ``decoder_canonical=False``.
+
+    Used by tests for :func:`_collect_canonical_tied_patterns` and
+    :func:`_reorder_canonical_first`. The legacy list-style branch exercises
+    the "no patterns extracted" negative case.
+    """
+    parent = nn.Module()
+    parent.encoder = enc
+    parent.decoder = dec
+
+    if decoder_canonical:
+        # Dict-style: regex pattern → canonical path. Mimics HF's per-class
+        # ``_tied_weights_keys`` declaration for an encoder/decoder model.
+        parent._tied_weights_keys = {
+            rf"^encoder\.{re.escape(weight_attr)}$": f"decoder.{weight_attr}",
+        }
+    else:
+        # Legacy list-style: just a list of tied paths, no canonical info.
+        parent._tied_weights_keys = [f"encoder.{weight_attr}"]
+
+    return parent
diff --git a/tests/unit/torch/export/test_unified_export_hf.py b/tests/unit/torch/export/test_unified_export_hf.py
new file mode 100644
index 00000000000..1d08a8ef620
--- /dev/null
+++ b/tests/unit/torch/export/test_unified_export_hf.py
@@ -0,0 +1,184 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for tied-weight helpers in unified_export_hf."""
+
+from collections import OrderedDict
+
+import torch
+from _test_utils.torch.quantization.tied_modules import (
+    make_tied_linear_pair,
+    wrap_in_parent_with_tied_keys,
+)
+
+import modelopt.torch.quantization as mtq
+from modelopt.torch.export.unified_export_hf import (
+    _collect_canonical_tied_patterns,
+    _export_quantized_weight,
+    _reorder_canonical_first,
+    sync_tied_input_amax,
+)
+
+
+def test_collect_canonical_tied_patterns_dict_style():
+    """Dict-style _tied_weights_keys yields regex patterns + canonical-side substrings."""
+    enc, dec = make_tied_linear_pair()
+    parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=True)
+
+    patterns, side_substrings = _collect_canonical_tied_patterns(parent)
+
+    assert len(patterns) >= 1
+    # "decoder" is in the canonical RHS but not the alias LHS — must auto-derive.
+    # "encoder" is alias-only and must NOT be returned as canonical (would invert dedup).
+    assert "decoder" in side_substrings
+    assert "encoder" not in side_substrings
+
+
+def test_collect_canonical_tied_patterns_list_style_yields_no_canonical_info():
+    """Legacy list-style _tied_weights_keys carries no canonical/alias info — returns empty."""
+    enc, dec = make_tied_linear_pair()
+    parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=False)
+
+    patterns, side_substrings = _collect_canonical_tied_patterns(parent)
+
+    assert patterns == []
+    assert side_substrings == []
+
+
+def test_reorder_canonical_first_puts_decoder_keys_before_encoder_keys():
+    """_reorder_canonical_first moves canonical-side state_dict keys ahead of alias-side keys."""
+    enc, dec = make_tied_linear_pair()
+    parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=True)
+
+    sd = OrderedDict(
+        [
+            ("encoder.weight", torch.zeros(1)),
+            ("unrelated.foo", torch.zeros(1)),
+            ("decoder.weight", torch.zeros(1)),
+        ]
+    )
+
+    reordered = _reorder_canonical_first(sd, parent)
+    keys = list(reordered.keys())
+
+    assert keys.index("decoder.weight") < keys.index("encoder.weight")
+    assert set(reordered) == set(sd)  # no drops or additions
+
+
+def _quantize_and_get_input_quantizers(parent):
+    """Insert FP8 quantizers via no-op forward_loop and return both input_quantizers."""
+    mtq.quantize(parent, mtq.FP8_DEFAULT_CFG, forward_loop=lambda m: None)
+    return parent.encoder.input_quantizer, parent.decoder.input_quantizer
+
+
+def test_sync_tied_input_amax_max_merges_tied_module_amaxes_in_place():
+    """Tied Linears with divergent input_quantizer.amax get both sides overwritten with the max."""
+    enc, dec = make_tied_linear_pair()
+    parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=True)
+    enc_q, dec_q = _quantize_and_get_input_quantizers(parent)
+
+    enc_q.amax = torch.tensor(2.0)
+    dec_q.amax = torch.tensor(5.0)
+
+    sync_tied_input_amax(parent)
+
+    expected = torch.tensor(5.0)
+    assert torch.allclose(enc_q.amax, expected)
+    assert torch.allclose(dec_q.amax, expected)
+
+
+def test_sync_tied_input_amax_no_op_for_untied_modules():
+    """Untied Linears keep their per-side amaxes — the helper is a no-op when there's no tie."""
+    parent = torch.nn.Module()
+    parent.encoder = torch.nn.Linear(16, 32, bias=False)
+    parent.decoder = torch.nn.Linear(16, 32, bias=False)
+    enc_q, dec_q = _quantize_and_get_input_quantizers(parent)
+
+    enc_q.amax = torch.tensor(2.0)
+    dec_q.amax = torch.tensor(5.0)
+
+    sync_tied_input_amax(parent)
+
+    assert torch.allclose(enc_q.amax, torch.tensor(2.0))
+    assert torch.allclose(dec_q.amax, torch.tensor(5.0))
+
+
+def _clear_export_quantized_weight_cache():
+    """Clear the function-static alias cache; isolates each test from prior session state."""
+    _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
+
+
+def _calibrate_through_both_children(parent):
+    """Insert NVFP4 quantizers and run a one-shot forward through both children for calibration."""
+
+    def forward_loop(m):
+        x = torch.randn(2, 16)
+        m.encoder(x)
+        m.decoder(x)
+
+    mtq.quantize(parent, mtq.NVFP4_DEFAULT_CFG, forward_loop=forward_loop)
+
+
+def test_export_quantized_weight_aliases_packed_weight_for_tied_linears():
+    """Tied Linears share data_ptr for packed .weight and scale buffers after export."""
+    _clear_export_quantized_weight_cache()
+
+    enc, dec = make_tied_linear_pair()
+    parent = wrap_in_parent_with_tied_keys(enc, dec)
+    _calibrate_through_both_children(parent)
+
+    _export_quantized_weight(enc, torch.float16, "weight")
+    _export_quantized_weight(dec, torch.float16, "weight")
+
+    assert enc.weight.data_ptr() == dec.weight.data_ptr()
+    for scale_attr in ("weight_scale", "weight_scale_2"):
+        if hasattr(enc, scale_attr) and hasattr(dec, scale_attr):
+            assert getattr(enc, scale_attr).data_ptr() == getattr(dec, scale_attr).data_ptr()
+
+
+def test_export_quantized_weight_no_alias_for_untied_linears():
+    """Untied Linears keep independent data_ptrs after export — no false-positive aliasing."""
+    _clear_export_quantized_weight_cache()
+
+    parent = torch.nn.Module()
+    parent.encoder = torch.nn.Linear(16, 32, bias=False)
+    parent.decoder = torch.nn.Linear(16, 32, bias=False)
+    assert parent.encoder.weight.data_ptr() != parent.decoder.weight.data_ptr()
+    _calibrate_through_both_children(parent)
+
+    _export_quantized_weight(parent.encoder, torch.float16, "weight")
+    _export_quantized_weight(parent.decoder, torch.float16, "weight")
+
+    assert parent.encoder.weight.data_ptr() != parent.decoder.weight.data_ptr()
+
+
+def test_export_quantized_weight_skips_alias_when_one_tied_side_is_unquantized():
+    """Unquantized side early-returns; its .weight stays at the original shared Parameter."""
+    _clear_export_quantized_weight_cache()
+
+    enc, dec = make_tied_linear_pair()
+    parent = wrap_in_parent_with_tied_keys(enc, dec)
+    original_shared_data_ptr = enc.weight.data_ptr()
+
+    _calibrate_through_both_children(parent)
+    # is_enabled is a read-only property; .disable() is the canonical bypass.
+    dec.weight_quantizer.disable()
+
+    _export_quantized_weight(enc, torch.float16, "weight")
+    _export_quantized_weight(dec, torch.float16, "weight")
+
+    assert enc.weight.data_ptr() != original_shared_data_ptr  # encoder got fresh packed
+    assert dec.weight.data_ptr() == original_shared_data_ptr  # decoder untouched
+    assert enc.weight.data_ptr() != dec.weight.data_ptr()
diff --git a/tests/unit/torch/quantization/plugins/test_fused_experts.py b/tests/unit/torch/quantization/plugins/test_fused_experts.py
index ce23f7a51d5..ef2df36090e 100644
--- a/tests/unit/torch/quantization/plugins/test_fused_experts.py
+++ b/tests/unit/torch/quantization/plugins/test_fused_experts.py
@@ -22,6 +22,8 @@
 
 pytest.importorskip("transformers")
 
+from _test_utils.torch.quantization.tied_modules import tie_fused_experts_3d_params
+
 import modelopt.torch.quantization as mtq
 from modelopt.torch.export.moe_utils import _export_fused_experts
 from modelopt.torch.export.quant_utils import get_quant_config
@@ -514,6 +516,115 @@ def _spy_export(wrapper, dtype):
                 QuantModuleRegistry.unregister(expert_type)
 
 
+# ---------------------------------------------------------------------------
+# Tests for tied-experts dedup in _export_fused_experts
+# ---------------------------------------------------------------------------
+def _build_two_moe_blocks(tie: bool) -> nn.Module:
+    """Build a parent with two _SyntheticSparseMoeBlock children, optionally with tied 3-D params."""
+    parent = nn.Module()
+    parent.encoder = _SyntheticSparseMoeBlock()
+    parent.decoder = _SyntheticSparseMoeBlock()
+    if tie:
+        tie_fused_experts_3d_params(parent.encoder.experts, parent.decoder.experts)
+    return parent
+
+
+def _moe_fp8_quant_cfg():
+    """Custom inline FP8 cfg targeting the MoE-specific quantizer names."""
+    return {
+        "quant_cfg": [
+            {"quantizer_name": "*", "enable": False},
+            {
+                "quantizer_name": "*gate_up_proj_input_quantizer",
+                "cfg": {"num_bits": 8, "axis": None},
+            },
+            {"quantizer_name": "*down_proj_input_quantizer", "cfg": {"num_bits": 8, "axis": None}},
+            {"quantizer_name": "*gate_up_proj_weight_quantizer", "cfg": {"num_bits": 8, "axis": 0}},
+            {"quantizer_name": "*down_proj_weight_quantizer", "cfg": {"num_bits": 8, "axis": 0}},
+        ],
+        "algorithm": "max",
+    }
+
+
+def _calibrate_two_moe_blocks(parent):
+    """Fire one calibration batch through both encoder.experts and decoder.experts."""
+
+    def forward_loop(m):
+        torch.manual_seed(0)
+        x = torch.randn(1, 4, HIDDEN_DIM)
+        m.encoder(x)
+        m.decoder(x)
+
+    mtq.quantize(parent, _moe_fp8_quant_cfg(), forward_loop=forward_loop)
+
+
+def _clear_fused_experts_caches():
+    """Clear function-static alias caches in both export entry points."""
+    _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None)
+    # _export_fused_experts internally calls _export_quantized_weight per per-expert
+    # wrapper; clear that cache too so each test sees a pristine state.
+    from modelopt.torch.export.unified_export_hf import _export_quantized_weight
+
+    _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
+
+
+class TestExportFusedExpertsTiedDedup:
+    @staticmethod
+    def _cleanup_registry(mod_type):
+        if QuantModuleRegistry.get(mod_type) is not None:
+            QuantModuleRegistry.unregister(mod_type)
+
+    def test_per_expert_buffers_share_data_ptr_for_tied_fused_experts(self):
+        """Two tied FusedExperts modules: every per-expert .weight + scale buffer shares data_ptr."""
+        _clear_fused_experts_caches()
+        parent = _build_two_moe_blocks(tie=True)
+        expert_type = type(parent.encoder.experts)
+        self._cleanup_registry(expert_type)
+        try:
+            _calibrate_two_moe_blocks(parent)
+
+            _export_fused_experts(parent.encoder.experts, torch.float16)
+            _export_fused_experts(parent.decoder.experts, torch.float16)
+
+            for idx in range(NUM_EXPERTS):
+                enc_expert = getattr(parent.encoder.experts, str(idx))
+                dec_expert = getattr(parent.decoder.experts, str(idx))
+                for proj_name in ("gate_proj", "up_proj", "down_proj"):
+                    enc_proj = getattr(enc_expert, proj_name)
+                    dec_proj = getattr(dec_expert, proj_name)
+                    assert enc_proj.weight.data_ptr() == dec_proj.weight.data_ptr()
+                    for scale_attr in ("weight_scale", "weight_scale_2"):
+                        if hasattr(enc_proj, scale_attr) and hasattr(dec_proj, scale_attr):
+                            assert (
+                                getattr(enc_proj, scale_attr).data_ptr()
+                                == getattr(dec_proj, scale_attr).data_ptr()
+                            )
+        finally:
+            self._cleanup_registry(expert_type)
+
+    def test_per_expert_buffers_have_independent_data_ptrs_for_untied_fused_experts(self):
+        """Two untied FusedExperts modules: per-expert buffers stay independent (no false-positive alias)."""
+        _clear_fused_experts_caches()
+        parent = _build_two_moe_blocks(tie=False)
+        expert_type = type(parent.encoder.experts)
+        self._cleanup_registry(expert_type)
+        try:
+            _calibrate_two_moe_blocks(parent)
+
+            _export_fused_experts(parent.encoder.experts, torch.float16)
+            _export_fused_experts(parent.decoder.experts, torch.float16)
+
+            for idx in range(NUM_EXPERTS):
+                enc_expert = getattr(parent.encoder.experts, str(idx))
+                dec_expert = getattr(parent.decoder.experts, str(idx))
+                for proj_name in ("gate_proj", "up_proj", "down_proj"):
+                    enc_proj = getattr(enc_expert, proj_name)
+                    dec_proj = getattr(dec_expert, proj_name)
+                    assert enc_proj.weight.data_ptr() != dec_proj.weight.data_ptr()
+        finally:
+            self._cleanup_registry(expert_type)
+
+
 # ---------------------------------------------------------------------------
 # Tests for force_eager_experts_impl_on_the_fly
 # ---------------------------------------------------------------------------

From 0543907bec98a4cd9a20fb19f9c6e8f1dc6bb9d0 Mon Sep 17 00:00:00 2001
From: Juhi Mittal <juhim@nvidia.com>
Date: Sat, 13 Jun 2026 01:00:01 +0000
Subject: [PATCH 8/8] docs: add CHANGELOG entry for tied-weight PTQ support

Adds a New Features bullet under 0.46 covering the tied-weight dedup,
canonical-side reorder, sync_tied_input_amax helper, and the
*self_conditioning* default exclude introduced earlier in this series.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
---
 CHANGELOG.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index 49c58586674..b218cdbb5bb 100755
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -8,6 +8,7 @@ Changelog
 
 - Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred.
 - Add **streaming** speculative-decoding training (EAGLE3 / DFlash): the draft trains on base-model hidden states produced on the fly by a co-located ``vllm serve`` (no disk dump), moved trainer-side over NIXL RDMA, scaling to multi-node (dedicated serve replicas + DDP trainers). New launcher examples for NVFP4 Kimi-K2.5 / K2.6 on GB200/aarch64 under ``tools/launcher/examples/moonshotai/``.
+- Add tied-weight PTQ and HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF ``_tied_weights_keys``. ``_export_quantized_weight`` and ``_export_fused_experts`` now alias bit-identical packed ``weight`` / ``weight_scale`` / ``weight_scale_2`` buffers across modules sharing a source weight ``data_ptr()`` so the downstream ``postprocess_state_dict`` dedup catches them (~42% storage reduction on ``nvfp4_experts_only`` for tied 26B MoE checkpoints). New ``sync_tied_input_amax`` helper max-merges per-side ``input_quantizer.amax`` across tied modules before export so single-backbone consumers that load one ``input_scale`` per parameter don't clip either side. Opt-in ``--canonical_tied_naming`` flag (default off) reorders the state_dict so canonical-side keys per HF's ``_tied_weights_keys`` declaration win the data_ptr dedup. ``default_disabled_quantizers`` gains a ``*self_conditioning*`` wildcard companion to the upstream vision excludes (PR #1691). ``hf_ptq.py`` also unwraps ``ModelOutput`` dataclasses from ``.generate()`` so the preview decode works on diffusion models. Non-tied models see no behavioral change.
 
 0.45 (2026-06-xx)
 ^^^^^^^^^^^^^^^^^