From 47d4ab639bf816a9728496203005848a7584466d Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Thu, 28 May 2026 00:32:19 +0000 Subject: [PATCH 1/8] Onboard DiffusionGemma (northbloom) block-diffusion text model to hf_ptq.py DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder + iterative-decoder + self-conditioning + 48-step denoising loop). Four small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end for it; no new entry points needed. Substring patterns are chosen so they ALSO match the previous class name DiffusionGemma4ModelForBlockDiffusion (and previous model_type "diffusion_gemma4"). The model was renamed in transformers mid-development; using "diffusiongemma" / "DiffusionGemma" / "diffusion_gemma" matches both the current and the older class so we don't silently regress on either checkpoint generation. 1. modelopt/torch/utils/dataset_utils.py Add "diffusiongemma" to the substring list in model_type_is_enc_dec. This routes ModelOpt's built-in calibration forward_loop to model.generate() instead of a single model.forward(), so calibration exercises the full 48-step inner denoising loop and sees the entire noise->clean activation distribution. 2. modelopt/torch/export/model_utils.py Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE, *before* "Gemma". get_model_type does substring matching; "gemma" is a substring of "diffusiongemma" so without this entry the class is silently mis-classified as plain "gemma" -- which then mis-routes downstream model-type-dependent logic. 3. examples/llm_ptq/hf_ptq.py (output_decode) DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput ModelOutput dataclass, not a bare token tensor. The preview decode code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences at the top of output_decode so both the enc-dec and AR slicing branches work uniformly. Generic shim -- helps any model whose .generate returns a ModelOutput with a .sequences attribute. 4. examples/llm_ptq/example_utils.py (is_enc_dec) Comment-only clarification on the semantics. is_enc_dec controls how hf_ptq.py:output_decode slices the preview result (whether to strip the prompt prefix). For T5/BART/Whisper generate returns only new tokens. For DiffusionGemma it returns prompt+canvas concatenated, so it belongs with the AR slicing path and stays out of this list. No behavioral change; documents the prior intent so future contributors don't re-add diffusion_gemma here. End-to-end smoke test: python hf_ptq.py --pyt_ckpt_path \ --qformat nvfp4_experts_only \ --calib_size 32 --trust_remote_code \ --export_path produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- examples/llm_ptq/example_utils.py | 8 +++++++- examples/llm_ptq/hf_ptq.py | 5 +++++ modelopt/torch/export/model_utils.py | 3 +++ modelopt/torch/utils/dataset_utils.py | 5 ++++- 4 files changed, 19 insertions(+), 2 deletions(-) diff --git a/examples/llm_ptq/example_utils.py b/examples/llm_ptq/example_utils.py index d36754a8d42..1d3c196d428 100755 --- a/examples/llm_ptq/example_utils.py +++ b/examples/llm_ptq/example_utils.py @@ -806,7 +806,13 @@ def is_model_on_gpu(model) -> bool: def is_enc_dec(model_type) -> bool: - """Return if the model is a encoder-decoder model.""" + """Return whether the model_type uses encoder-decoder-style preview decode. + + Controls whether ``hf_ptq.py`` slices off the prompt prefix from + ``.generate()`` output. ``diffusion_gemma`` is structurally encoder-decoder + but returns prompt+canvas concatenated, so it stays OFF this list (AR-style + decode applies). + """ return model_type in ["t5", "bart", "whisper"] diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py index afb725988c8..6b7a5d773a6 100755 --- a/examples/llm_ptq/hf_ptq.py +++ b/examples/llm_ptq/hf_ptq.py @@ -941,6 +941,11 @@ def input_decode(input_ids): raise ValueError("The processor or tokenizer must be set") def output_decode(generated_ids, input_shape): + # Some `.generate()` returns a ModelOutput dataclass (e.g. DiffusionGemma); + # unwrap to the token tensor so downstream slicing works uniformly. + if hasattr(generated_ids, "sequences"): + generated_ids = generated_ids.sequences + if is_enc_dec(model_type): if processor is not None and isinstance(processor, WhisperProcessor): return processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] diff --git a/modelopt/torch/export/model_utils.py b/modelopt/torch/export/model_utils.py index 3bd72d9de91..9c49cae0cf1 100755 --- a/modelopt/torch/export/model_utils.py +++ b/modelopt/torch/export/model_utils.py @@ -33,6 +33,9 @@ "Qwen3Next": "qwen3next", "QWen": "qwen", "RecurrentGemma": "recurrentgemma", + # DiffusionGemma must come before "Gemma" — get_model_type substring-matches + # in order, and "gemma" is a substring of "diffusiongemma". + "DiffusionGemma": "diffusion_gemma", "Gemma3": "gemma3", "Gemma2": "gemma2", "Gemma": "gemma", diff --git a/modelopt/torch/utils/dataset_utils.py b/modelopt/torch/utils/dataset_utils.py index e9b53897fee..9dd608fee9e 100644 --- a/modelopt/torch/utils/dataset_utils.py +++ b/modelopt/torch/utils/dataset_utils.py @@ -1208,7 +1208,10 @@ def create_forward_loop( def model_type_is_enc_dec(model): - enc_dec_model_list = ["t5", "bart", "whisper"] + # Substring match against `model.__class__.__name__.lower()` — entries are + # the lowercased class-name form (no underscores). Calibration then uses + # `model.generate` to run the full denoising loop. + enc_dec_model_list = ["t5", "bart", "whisper", "diffusiongemma"] return any(model_name in model.__class__.__name__.lower() for model_name in enc_dec_model_list) From 60d4ebb26b8def23dd9d19aca7a60c0740a2d9ec Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Thu, 28 May 2026 00:47:36 +0000 Subject: [PATCH 2/8] moe export: alias bit-identical per-expert buffers between tied modules MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When a model has multiple fused-experts modules whose 3-D source params share storage via HF _tied_weights_keys (e.g. the encoder and decoder transformer stacks of a block-diffusion encoder-decoder LLM like DiffusionGemma4 / northbloom), the unpacking loop in _export_fused_experts ordinarily creates fresh per-expert tensors for each call — destroying the tied identity and writing two full sets of expert weights + scales to disk. This adds a function-local cache keyed by (gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the existing unpacking path runs unchanged. On a cache hit, after the normal unpacking completes, the per-expert weight / weight_scale / weight_scale_2 buffers are re-pointed at the prior module's tensors so they share storage. The downstream postprocess_state_dict data_ptr()-based dedup then catches them and drops the duplicates from the saved checkpoint. input_scale is intentionally NOT aliased: encoder and decoder paths have legitimately different activation distributions (verified across all 60 tied pairs of a 512-prompt calibration — down_proj_input divergence median 2.26x, max 18.4x), so each side keeps its own per-side calibrated scale. Model-agnostic: no name regex, no model-type lookup. Cache miss falls through to existing behavior, so non-tied models are unaffected. Empirical result on DiffusionGemma4 26B at nvfp4_experts_only: safetensors 28.43 GB -> 16.47 GB (-42%) shards 4 -> 2 decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each decoder input_scale entries: 3840 -> 3840 (kept per-side, as intended) Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- modelopt/torch/export/moe_utils.py | 50 ++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/modelopt/torch/export/moe_utils.py b/modelopt/torch/export/moe_utils.py index e325e5346f1..059955f81c6 100644 --- a/modelopt/torch/export/moe_utils.py +++ b/modelopt/torch/export/moe_utils.py @@ -42,6 +42,13 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None: {E}.gate_proj.weight, {E}.gate_proj.weight_scale, ... {E}.up_proj.weight, {E}.up_proj.weight_scale, ... {E}.down_proj.weight, {E}.down_proj.weight_scale, ... + + Tied-experts dedup: when multiple fused-expert modules share their 3-D + source params via HF ``_tied_weights_keys``, the unpacking creates fresh + per-expert tensors that break the tie. We cache the source ``data_ptr()`` + at entry and on a later cache hit alias the per-expert ``weight`` / + ``weight_scale`` / ``weight_scale_2`` back to the prior module so + downstream dedup catches them. ``input_scale`` is left per-side. """ from modelopt.torch.export.unified_export_hf import _export_quantized_weight from modelopt.torch.quantization.plugins.huggingface import _get_fused_expert_intermediate_dim @@ -49,6 +56,10 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None: n = module.num_experts expert_dim = _get_fused_expert_intermediate_dim(module) + # Capture source tensor identities BEFORE unpacking (the source + # attrs are deleted at the end of this function). + _source_key = (module.gate_up_proj.data_ptr(), module.down_proj.data_ptr()) + # 1. Shared input quantizers — one per projection type, shared across all experts. gate_up_input_q = module.gate_up_proj_input_quantizer down_input_q = module.down_proj_input_quantizer @@ -178,6 +189,45 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None: if hasattr(module, attr): delattr(module, attr) + # 5. Tied-experts dedup: if this module's source params have been seen + # before, alias the bit-identical per-expert buffers (weight, + # weight_scale, weight_scale_2) to the previously-unpacked module. + # input_scale is left per-side so encoder/decoder calibration stays + # accurate where their activation distributions diverge. + _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {}) + _prior = _cache.get(_source_key) + if _prior is not None and _prior is not module: + for _idx in range(n): + _cur_expert = getattr(module, str(_idx), None) + _prior_expert = getattr(_prior, str(_idx), None) + if _cur_expert is None or _prior_expert is None: + continue + for _proj_name in ("gate_proj", "up_proj", "down_proj"): + _cur_proj = getattr(_cur_expert, _proj_name, None) + _prior_proj = getattr(_prior_expert, _proj_name, None) + if _cur_proj is None or _prior_proj is None: + continue + # Alias the weight (Parameter) so both sides reference the + # same nn.Parameter → same data_ptr() → existing dedup + # in postprocess_state_dict will drop the duplicate. + if hasattr(_prior_proj, "weight"): + _cur_proj.weight = _prior_proj.weight + # Alias the bit-identical scale buffers. Re-register to + # ensure data_ptr() matches the prior side's tensor. + for _attr in ("weight_scale", "weight_scale_2"): + if not hasattr(_prior_proj, _attr): + continue + if _attr in _cur_proj._buffers: + del _cur_proj._buffers[_attr] + elif hasattr(_cur_proj, _attr): + delattr(_cur_proj, _attr) + _cur_proj.register_buffer(_attr, getattr(_prior_proj, _attr)) + # input_scale intentionally NOT aliased — per-side amaxes + # are legitimately different (encoder vs decoder activation + # distributions diverge, sometimes >10x — see Q2 analysis). + else: + _cache[_source_key] = module + def save_expert_token_count_table(model: nn.Module, output_dir: str | Path | None = None): """Collect expert_token_count from all quantized MoE layers and save as an HTML table. From 225072b98528b7e2f82dd641ccbfa2ad62e07fb7 Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Thu, 28 May 2026 04:25:13 +0000 Subject: [PATCH 3/8] export: alias bit-identical buffers between tied dense Linears MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symmetric companion to the data_ptr-cache alias added to _export_fused_experts in the prior commit. Catches plain nn.Linear modules whose .weight Parameters are tied via HF tie_weights() (e.g. encoder and decoder attention QKV/O, MLP gate/up/down, router proj of an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they are quantized under recipes that route through _export_quantized_weight. Mechanism: capture weight.data_ptr() at the top of the function, before the setattr further down wraps the packed bytes in a fresh nn.Parameter (which destroys the tie). At the end of the function, consult a function-local cache keyed by that captured data_ptr. On cache miss: register this sub_module as the canonical owner. On cache hit (a previously-processed module shared the same source weight memory): alias .weight, weight_scale, weight_scale_2 to the prior module's tensors so downstream data_ptr-based dedup in postprocess_state_dict drops the duplicates. input_scale is intentionally NOT aliased — calibration legitimately diverges per-side (verified in Q2 analysis: down_proj_input ratio up to 18x across 60 tied pairs on the northbloom DiffusionGemma4 model). Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach this function but have fresh data_ptrs from upstream slice+contiguous so cache misses always). Under full nvfp4 it fires for every tied dense Linear pair. Same safety guarantees as the existing data_ptr-based dedup: cannot false-positive because it only aliases when source memory was already shared by the model author (via tie_weights or equivalent). No name regex, no _tied_weights_keys lookup, no model introspection. Empirical results on DiffusionGemma4 26B (calib_size=8 smoke): nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior experts-only export; this patch a no-op as expected) nvfp4 (full): ~27 GB est -> 14.24 GB (-12.7 GB; both patches firing on disjoint module sets; per-name dedup verified in index) Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- modelopt/torch/export/unified_export_hf.py | 43 ++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py index ef5757aa0cb..1b0e0cb95db 100644 --- a/modelopt/torch/export/unified_export_hf.py +++ b/modelopt/torch/export/unified_export_hf.py @@ -520,6 +520,14 @@ def _export_quantized_weight( The export includes converting weight tensor to correct quantized values and quantized dtype, and registering scaling factors. + + Tied-weight dedup: the setattr below replaces ``.weight`` with a fresh + ``nn.Parameter`` wrapping packed bytes, breaking any HF-level tie. + We capture ``weight.data_ptr()`` before the replacement and consult a + function-local cache at the end; on cache hit, ``weight`` / ``weight_scale`` / + ``weight_scale_2`` are re-pointed at the previously-processed module so the + downstream data_ptr dedup catches them. Uses memory identity only — no + ``_tied_weights_keys`` lookup, no-op for non-tied modules. """ quantization_format = get_quantization_format(sub_module) if quantization_format == QUANTIZATION_NONE: @@ -528,6 +536,13 @@ def _export_quantized_weight( block_size = get_weight_block_size(sub_module, weight_name) quantizer_attrs = quantizer_attr_names(weight_name) weight: nn.Parameter = getattr(sub_module, weight_name) + + # Capture source identity BEFORE any tensor-creating operation below. + # For HF-tied weights this matches across all modules sharing the + # underlying Parameter; the cache lookup at the end of this function + # uses it to detect ties whose Python identity is about to be broken + # by the setattr on `weight_name` further down. + _tied_source_data_ptr = weight.data_ptr() weight_quantizer: TensorQuantizer | SequentialQuantizer = getattr( sub_module, quantizer_attrs.weight_quantizer ) @@ -703,6 +718,34 @@ def _export_quantized_weight( if weight_scale is not None: sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scale) + # Tied-weight dedup: if a previously-processed module shared the same + # source weight memory, alias the bit-identical packed weight and scale + # buffers from that prior module so downstream data_ptr-based dedup in + # postprocess_state_dict can collapse the duplicates. Per-side + # input_scale is intentionally NOT aliased — activation amaxes + # legitimately differ across tied modules whose forward paths see + # different distributions. + _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {}) + _prior = _cache.get(_tied_source_data_ptr) + if _prior is not None and _prior is not sub_module: + # Alias the packed weight (same nn.Parameter -> same data_ptr). + if hasattr(_prior, weight_name): + setattr(sub_module, weight_name, getattr(_prior, weight_name)) + # Alias bit-identical scale buffers (NOT input_scale). + for _attr in ( + quantizer_attrs.weight_scale, + quantizer_attrs.weight_scale_2, + ): + if _attr is None or not hasattr(_prior, _attr): + continue + if _attr in sub_module._buffers: + del sub_module._buffers[_attr] + elif hasattr(sub_module, _attr): + delattr(sub_module, _attr) + sub_module.register_buffer(_attr, getattr(_prior, _attr)) + else: + _cache[_tied_source_data_ptr] = sub_module + torch.cuda.empty_cache() From a0d9b650686694bb6a8c64cbe06e52155b15d06a Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Fri, 29 May 2026 00:37:25 +0000 Subject: [PATCH 4/8] export: opt-in reorder of tied-weight aliases to canonical-side names Adds an opt-in pass (--canonical_tied_naming, default off) that reorders the state_dict before postprocess_state_dict so that keys matching the canonical side of HF's _tied_weights_keys declaration iterate before their aliases. The existing first-wins data_ptr dedup at quant_utils.py:1148-1163 then drops the alias names, leaving the canonical names in the exported safetensors. Motivation: for models like DiffusionGemma4 (northbloom), HF declares {alias: canonical} via _tied_weights_keys, where the encoder side is the alias and the decoder side is canonical. The original HF safetensors index uses decoder-prefixed names for all tied weights (661/691 keys vs 30/691 encoder-only layer_scalar keys). The single-backbone vLLM mockup loader strips both prefixes to model.* and relies on this canonical naming. The default modelopt export today walks the model in registration order (encoder before decoder, per the model's __init__ order), so encoder names win in the first-wins dedup. The exported checkpoint thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards relative to the upstream HF naming and to what downstream consumers expect. Implementation. _tied_weights_keys is declared per model class with paths relative to that class. In nested models (e.g. DiffusionGemma4) multiple submodules declare their own ties: the outer wrapper at DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model at model.model declares the much larger encoder<->decoder dict with paths relative to itself. _collect_canonical_tied_patterns walks model.named_modules() and collects every dict-style _tied_weights_keys declaration, prefixing each pattern with the submodule's qualified path so the regexes match against root-level state_dict keys. Without the prefix, the inner dict's patterns (which lack a "model." prefix) silently fail to match keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug that would cause only the outer dict's single entry (embed_tokens) to flip in the dedup. _reorder_canonical_first then partitions the state_dict into head (canonical-pattern matches) and tail (everything else), preserving original order within each partition. head.update(tail) yields a single dict with canonical keys first. The downstream dedup loop iterates this in insertion order and records the canonical names in seen_tensors; alias names then arrive as duplicates and are dropped. Scope of behavior change: - Models with no _tied_weights_keys, or only legacy list-of-strings declarations: _collect returns an empty pattern list, helper short-circuits, state_dict returned unchanged. Zero effect on any existing modelopt user. - Models with dict-style declarations (e.g. DiffusionGemma4): when the flag is set, canonical-side names win. When the flag is unset (default), behavior is identical to before this commit. No changes to dedup logic itself, to the existing tied-weight alias patches in _export_quantized_weight and _export_fused_experts, or to _process_quantized_modules iteration. Strictly additive. Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size 32: 35 127 encoder keys removed (decoder kept), 0 decoder keys removed; layer_scalar (30 encoder-only keys) and per-side input_scale (11 520 keys, intentionally not deduped) unaffected; total safetensors bytes 17.68 GB matching the prior export to within 500 KB of safetensors-metadata-ordering noise. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- examples/llm_ptq/hf_ptq.py | 14 ++++ modelopt/torch/export/unified_export_hf.py | 91 +++++++++++++++++++++- 2 files changed, 103 insertions(+), 2 deletions(-) diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py index 6b7a5d773a6..14215853eaa 100755 --- a/examples/llm_ptq/hf_ptq.py +++ b/examples/llm_ptq/hf_ptq.py @@ -774,6 +774,7 @@ def export_quantized( full_model, export_dir=export_path, extra_state_dict=mtp_state_dict, + canonical_tied_naming=args.canonical_tied_naming, ) if args.qformat == "w4a16_nvfp4": @@ -1257,6 +1258,19 @@ def parse_args() -> argparse.Namespace: default=512, ) parser.add_argument("--export_path", default="exported_model") + parser.add_argument( + "--canonical_tied_naming", + type=lambda s: s.lower() in ("1", "true", "yes"), + default=False, + help=( + "If True, reorder the exported state_dict so tied-weight aliases " + "dedup to the canonical side declared in the model's HF " + "_tied_weights_keys (e.g. decoder-side for DiffusionGemma4). Off " + "by default to avoid renaming exported keys for models whose " + "downstream consumers expect the legacy (registration-order) " + "winner." + ), + ) parser.add_argument( "--dataset", help=( diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py index 1b0e0cb95db..72cf2dce4f3 100644 --- a/modelopt/torch/export/unified_export_hf.py +++ b/modelopt/torch/export/unified_export_hf.py @@ -749,6 +749,71 @@ def _export_quantized_weight( torch.cuda.empty_cache() +def _collect_canonical_tied_patterns(model: nn.Module) -> list[re.Pattern]: + """Walk the model and collect canonical-side tied-weight patterns. + + HF's ``_tied_weights_keys`` is declared per model class with paths + relative to that class. In nested models, each submodule may declare + its own ties (e.g. an outer wrapper ties ``lm_head.weight`` to + ``model.decoder.embed_tokens.weight``, while the inner + ``DiffusionGemma4Model`` at ``model.model`` declares the much larger + encoder↔decoder dict, with paths relative to itself such as + ``encoder.language_model.layers...weight`` ↔ ``decoder.layers...weight``). + + To match against the root model's state_dict keys we must prefix each + submodule's patterns with its qualified path (``model.``). Without this + prefix, the inner dict's patterns (which lack the ``model.`` prefix) + silently fail to match real keys like + ``model.decoder.layers.0.self_attn.q_proj.weight``. + + Returns a list of compiled regex patterns for the canonical side of + every dict-style ``_tied_weights_keys`` declaration found anywhere in + the module tree. List-style (legacy) declarations are skipped — they + carry no canonical/alias distinction. + """ + patterns: list[re.Pattern] = [] + for name, submodule in model.named_modules(): + tied = getattr(submodule, "_tied_weights_keys", None) + if not isinstance(tied, dict) or not tied: + continue + prefix = f"{name}." if name else "" + patterns.extend(re.compile(prefix + p) for p in tied.values()) + return patterns + + +def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict: + """Reorder ``state_dict`` so canonical-side tied keys iterate first. + + For models that declare ``_tied_weights_keys`` as a ``{alias_pattern: + canonical_pattern}`` dict (newer HF style, e.g. ``DiffusionGemma4``), + HF designates one side of each tied pair as canonical and the other + as an alias. The downstream data_ptr dedup in + :func:`postprocess_state_dict` keeps whichever key it sees first per + ``data_ptr``, which by default is registration order — and that is + often the alias side, not the canonical side declared by HF. + + This helper rebuilds the dict with canonical-pattern-matching keys + moved to the front (preserving original order within each partition), + so the existing first-wins dedup picks the canonical side. + + No-op when the model declares no dict-style ``_tied_weights_keys`` + anywhere in its module tree (i.e. only legacy list-of-strings + declarations, or no ties at all). + """ + canonical_patterns = _collect_canonical_tied_patterns(model) + if not canonical_patterns: + return state_dict + head: dict = {} + tail: dict = {} + for k, v in state_dict.items(): + if any(p.search(k) for p in canonical_patterns): + head[k] = v + else: + tail[k] = v + head.update(tail) + return head + + def _process_quantized_modules( model: nn.Module, dtype: torch.dtype, @@ -858,7 +923,11 @@ def _process_quantized_modules( def _export_transformers_checkpoint( - model: nn.Module, dtype: torch.dtype | None = None, is_modelopt_qlora: bool = False, **kwargs + model: nn.Module, + dtype: torch.dtype | None = None, + is_modelopt_qlora: bool = False, + canonical_tied_naming: bool = False, + **kwargs, ) -> tuple[dict[str, Any], dict[str, Any]]: """Exports the torch model to the packed checkpoint with original HF naming. @@ -995,6 +1064,16 @@ def _export_transformers_checkpoint( # We define kv cache scale as amax / 448 for both FP8 and NVFP4 KV cache quantization. kv_cache_max_bound = 448 kv_cache_format = quant_config["quantization"]["kv_cache_quant_algo"] + + # Optionally reorder so canonical-side tied keys (per HF's + # _tied_weights_keys) iterate first into postprocess_state_dict's + # first-wins data_ptr dedup. Off by default to avoid renaming exported + # keys for models whose downstream consumers expect the legacy + # (registration-order) winner; opt in for models where matching HF's + # own naming convention matters (e.g. DiffusionGemma4 → decoder names). + if canonical_tied_naming: + quantized_state_dict = _reorder_canonical_first(quantized_state_dict, model) + quantized_state_dict = postprocess_state_dict( quantized_state_dict, kv_cache_max_bound, kv_cache_format, is_modelopt_qlora ) @@ -1332,6 +1411,7 @@ def export_hf_checkpoint( components: list[str] | None = None, extra_state_dict: dict[str, torch.Tensor] | None = None, max_shard_size: int | str = "10GB", + canonical_tied_naming: bool = False, **kwargs, ): """Export quantized HuggingFace model checkpoint (transformers or diffusers). @@ -1351,6 +1431,11 @@ def export_hf_checkpoint( to export. If None, all quantized components are exported. extra_state_dict: Extra state dictionary to add to the exported model. max_shard_size: Maximum size of each safetensors shard file. Defaults to "10GB". + canonical_tied_naming: If True, reorder the state_dict so tied-weight + aliases dedup to the canonical side declared in the model's HF + ``_tied_weights_keys`` (e.g. decoder-side for DiffusionGemma4). + Off by default to avoid renaming exported keys for models whose + downstream consumers expect the legacy (registration-order) winner. **kwargs: Runtime-specific post-processing options forwarded to :func:`_postprocess_safetensors` for diffusion model exports. See its docstring for supported keys. @@ -1373,7 +1458,9 @@ def export_hf_checkpoint( return try: - post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype) + post_state_dict, hf_quant_config = _export_transformers_checkpoint( + model, dtype, canonical_tied_naming=canonical_tied_naming + ) # Only treat the export as quantized when at least one quant_algo field is set. # get_quant_config always returns a dict (even for sparsity-only or unmodified models), From e351c0f2e99f7f688920295a3482f5a6edad05b7 Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Fri, 29 May 2026 01:16:36 +0000 Subject: [PATCH 5/8] export: max-merge tied input_quantizer amaxes; alias input_scale buffers Companion piece to the existing tied-weight alias patches in _export_fused_experts (commit 8b00e85bb) and _export_quantized_weight (commit e8c36b024), which already alias bit-identical weight / weight_scale / weight_scale_2 between tied modules but leave input_scale per-side. This commit closes the loop on input_scale so consumers that load a single canonical scale per Linear (e.g. vLLM's single-backbone DiffusionGemma4 mockup) see a value consistent across all tied sides. Implementation has two parts. 1. New sync_tied_input_amax(model) helper. Walks named_modules(), groups by source weight data_ptr (same signature our existing dedup patches use), and max-merges input_quantizer.amax across each group. Uses the canonical 4-line idiom shared with preprocess_linear_fusion (quant_utils.py:1394-1401) and sync_moe_gate_up_amax (layer_utils.py:1197): merged = torch.max(torch.stack([q.amax for q in qs])) for q in qs: q.amax = merged.clone() Handles both dense Linears (keyed by weight.data_ptr) and fused MoE modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging gate_up_proj_input_quantizer and down_proj_input_quantizer independently across the group). Scalar-only, matching preprocess_linear_fusion's contract. Called unconditionally from _export_transformers_checkpoint after sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the merged amax flows into _export_quantized_weight's input_scale derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when applicable, no-op otherwise" convention. 2. Extend the existing tied-weight alias loops in _export_quantized_weight and _export_fused_experts to include input_scale alongside weight_scale / weight_scale_2. Before this commit those loops intentionally skipped input_scale because encoder/decoder amaxes legitimately differed (Q2 analysis showed up to 18x divergence for down_proj_input on v1). With sync_tied_input_amax in place, both sides now derive bit-identical input_scale values; aliasing the buffers is safe and lets the existing data_ptr dedup in postprocess_state_dict collapse them so only one canonical entry per Linear survives in the exported safetensors. Also extends the Q-B canonical-side reorder pass added in commit 837768fe3 with an auto-derived side-substring matcher. HF's _tied_weights_keys regex patterns target the pre-export module structure (fused gate_up_proj), but after _export_fused_experts unpacks them into per-expert gate_proj/up_proj/down_proj submodules, post-export keys like ...experts.Y.gate_proj.input_scale are not covered by HF's regex. Without the substring fallback, those keys fell through Q-B to the "alias-first" partition, so when the new input_scale alias step shared data_ptrs, the encoder name won the dedup instead of the decoder name. _collect_canonical_tied_patterns now returns (patterns, side_substrings). The side_substrings list is auto-derived from each _tied_weights_keys entry as the set of dot-separated tokens that appear in canonical patterns but not in alias patterns. For DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern contains "decoder", no alias pattern does. _reorder_canonical_first treats a key as canonical if it matches a regex pattern OR contains a side substring as a proper path component (bordered by "." or at start/end). The path-component requirement avoids false positives from accidental name collisions. Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32: the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries that the prior export carried are removed; the 11 520 decoder-side entries remain with the merged amax-derived value. Total bytes drops by ~1 MB (scalar entries). Other tied-tensor entries (weight, weight_scale, weight_scale_2) and encoder-only entries (layer_scalar, 30 keys) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- modelopt/torch/export/moe_utils.py | 19 ++- modelopt/torch/export/unified_export_hf.py | 186 +++++++++++++++------ 2 files changed, 145 insertions(+), 60 deletions(-) diff --git a/modelopt/torch/export/moe_utils.py b/modelopt/torch/export/moe_utils.py index 059955f81c6..3b0e49fe54a 100644 --- a/modelopt/torch/export/moe_utils.py +++ b/modelopt/torch/export/moe_utils.py @@ -191,9 +191,11 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None: # 5. Tied-experts dedup: if this module's source params have been seen # before, alias the bit-identical per-expert buffers (weight, - # weight_scale, weight_scale_2) to the previously-unpacked module. - # input_scale is left per-side so encoder/decoder calibration stays - # accurate where their activation distributions diverge. + # weight_scale, weight_scale_2, input_scale) to the previously-unpacked + # module. input_scale is safe to alias because sync_tied_input_amax + # runs earlier in _export_transformers_checkpoint and max-merges the + # shared input_quantizer amaxes across tied fused-experts modules, so + # both sides now derive bit-identical input_scale values. _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {}) _prior = _cache.get(_source_key) if _prior is not None and _prior is not module: @@ -212,9 +214,11 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None: # in postprocess_state_dict will drop the duplicate. if hasattr(_prior_proj, "weight"): _cur_proj.weight = _prior_proj.weight - # Alias the bit-identical scale buffers. Re-register to - # ensure data_ptr() matches the prior side's tensor. - for _attr in ("weight_scale", "weight_scale_2"): + # Alias the bit-identical scale buffers (including + # input_scale, made safe by sync_tied_input_amax pre-export + # merging). Re-register to ensure data_ptr() matches the + # prior side's tensor. + for _attr in ("weight_scale", "weight_scale_2", "input_scale"): if not hasattr(_prior_proj, _attr): continue if _attr in _cur_proj._buffers: @@ -222,9 +226,6 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None: elif hasattr(_cur_proj, _attr): delattr(_cur_proj, _attr) _cur_proj.register_buffer(_attr, getattr(_prior_proj, _attr)) - # input_scale intentionally NOT aliased — per-side amaxes - # are legitimately different (encoder vs decoder activation - # distributions diverge, sometimes >10x — see Q2 analysis). else: _cache[_source_key] = module diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py index 72cf2dce4f3..89973facce1 100644 --- a/modelopt/torch/export/unified_export_hf.py +++ b/modelopt/torch/export/unified_export_hf.py @@ -719,22 +719,19 @@ def _export_quantized_weight( sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scale) # Tied-weight dedup: if a previously-processed module shared the same - # source weight memory, alias the bit-identical packed weight and scale - # buffers from that prior module so downstream data_ptr-based dedup in - # postprocess_state_dict can collapse the duplicates. Per-side - # input_scale is intentionally NOT aliased — activation amaxes - # legitimately differ across tied modules whose forward paths see - # different distributions. + # source weight memory, alias the packed weight + scale buffers so the + # downstream data_ptr dedup in postprocess_state_dict can collapse them. + # input_scale is safe to alias because sync_tied_input_amax (earlier in + # this export) already max-merged the per-side amaxes. _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {}) _prior = _cache.get(_tied_source_data_ptr) if _prior is not None and _prior is not sub_module: - # Alias the packed weight (same nn.Parameter -> same data_ptr). if hasattr(_prior, weight_name): setattr(sub_module, weight_name, getattr(_prior, weight_name)) - # Alias bit-identical scale buffers (NOT input_scale). for _attr in ( quantizer_attrs.weight_scale, quantizer_attrs.weight_scale_2, + quantizer_attrs.input_scale, ): if _attr is None or not hasattr(_prior, _attr): continue @@ -749,64 +746,72 @@ def _export_quantized_weight( torch.cuda.empty_cache() -def _collect_canonical_tied_patterns(model: nn.Module) -> list[re.Pattern]: - """Walk the model and collect canonical-side tied-weight patterns. - - HF's ``_tied_weights_keys`` is declared per model class with paths - relative to that class. In nested models, each submodule may declare - its own ties (e.g. an outer wrapper ties ``lm_head.weight`` to - ``model.decoder.embed_tokens.weight``, while the inner - ``DiffusionGemma4Model`` at ``model.model`` declares the much larger - encoder↔decoder dict, with paths relative to itself such as - ``encoder.language_model.layers...weight`` ↔ ``decoder.layers...weight``). - - To match against the root model's state_dict keys we must prefix each - submodule's patterns with its qualified path (``model.``). Without this - prefix, the inner dict's patterns (which lack the ``model.`` prefix) - silently fail to match real keys like - ``model.decoder.layers.0.self_attn.q_proj.weight``. - - Returns a list of compiled regex patterns for the canonical side of - every dict-style ``_tied_weights_keys`` declaration found anywhere in - the module tree. List-style (legacy) declarations are skipped — they - carry no canonical/alias distinction. +def _collect_canonical_tied_patterns( + model: nn.Module, +) -> tuple[list[re.Pattern], list[str]]: + """Walk the model and collect canonical-side tied-weight matchers. + + Patterns are submodule-prefixed regexes from each module's + ``_tied_weights_keys`` dict-style declaration (the prefix matters + for nested models where the dict lives on an inner submodule). + Side substrings are dot-separated tokens that appear only on the + canonical side of those declarations — needed because modelopt's + per-expert unpacking creates post-export keys (e.g. + ``…experts.Y.gate_proj.input_scale``) that HF's regexes never knew + about. List-style (legacy) declarations are skipped. """ patterns: list[re.Pattern] = [] + alias_token_set: set[str] = set() + canonical_token_set: set[str] = set() + + def _tokens(s: str) -> set[str]: + """Identifiers in a regex string, with regex specials as separators.""" + return {tok for tok in re.split(r"[^A-Za-z0-9_]+", s) if tok} + for name, submodule in model.named_modules(): tied = getattr(submodule, "_tied_weights_keys", None) if not isinstance(tied, dict) or not tied: continue prefix = f"{name}." if name else "" - patterns.extend(re.compile(prefix + p) for p in tied.values()) - return patterns + for alias_pat, canonical_pat in tied.items(): + patterns.append(re.compile(prefix + canonical_pat)) + alias_token_set.update(_tokens(prefix + alias_pat)) + canonical_token_set.update(_tokens(prefix + canonical_pat)) + + # Tokens unique to the canonical side become substring matchers. + side_substrings = sorted(canonical_token_set - alias_token_set) + return patterns, side_substrings def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict: - """Reorder ``state_dict`` so canonical-side tied keys iterate first. - - For models that declare ``_tied_weights_keys`` as a ``{alias_pattern: - canonical_pattern}`` dict (newer HF style, e.g. ``DiffusionGemma4``), - HF designates one side of each tied pair as canonical and the other - as an alias. The downstream data_ptr dedup in - :func:`postprocess_state_dict` keeps whichever key it sees first per - ``data_ptr``, which by default is registration order — and that is - often the alias side, not the canonical side declared by HF. - - This helper rebuilds the dict with canonical-pattern-matching keys - moved to the front (preserving original order within each partition), - so the existing first-wins dedup picks the canonical side. - - No-op when the model declares no dict-style ``_tied_weights_keys`` - anywhere in its module tree (i.e. only legacy list-of-strings - declarations, or no ties at all). + r"""Reorder ``state_dict`` so canonical-side tied keys iterate first. + + Lets the downstream first-wins data_ptr dedup keep canonical names. + Uses both regex patterns and substring matchers from + :func:`_collect_canonical_tied_patterns`. No-op when the model + declares no dict-style ``_tied_weights_keys``. """ - canonical_patterns = _collect_canonical_tied_patterns(model) - if not canonical_patterns: + canonical_patterns, side_substrings = _collect_canonical_tied_patterns(model) + if not canonical_patterns and not side_substrings: return state_dict + + def _has_side_substring(key: str) -> bool: + # Require the token to appear as a proper dot-separated path + # component, not just as a substring of an unrelated identifier. + for tok in side_substrings: + if ( + f".{tok}." in key + or key.startswith(f"{tok}.") + or key.endswith(f".{tok}") + or key == tok + ): + return True + return False + head: dict = {} tail: dict = {} for k, v in state_dict.items(): - if any(p.search(k) for p in canonical_patterns): + if any(p.search(k) for p in canonical_patterns) or _has_side_substring(k): head[k] = v else: tail[k] = v @@ -814,6 +819,76 @@ def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict: return head +def sync_tied_input_amax(model: nn.Module) -> int: + """Max-merge input_quantizer amaxes across modules sharing a weight ``data_ptr``. + + Closes the loop on ``input_scale`` for HF-tied modules whose forward + paths see different activation distributions (encoder vs decoder in + YOCO-style models). Must run BEFORE per-module export so the merged + amax flows into ``input_scale`` derivation. Handles both dense + Linears (keyed by ``weight.data_ptr()``) and fused MoE (keyed by + ``(gate_up_proj, down_proj)`` data_ptr tuple). Returns the number of + tied groups merged. + """ + from collections import defaultdict + + by_dp: dict = defaultdict(list) + for _, m in model.named_modules(): + # Fused MoE: 3-D source tensors with shared input quantizers + if ( + hasattr(m, "gate_up_proj_input_quantizer") + and hasattr(m, "gate_up_proj") + and hasattr(m, "down_proj") + and m.gate_up_proj.dim() == 3 + ): + key = ("moe", m.gate_up_proj.data_ptr(), m.down_proj.data_ptr()) + by_dp[key].append(m) + # Dense quantized Linear with an input_quantizer + elif ( + hasattr(m, "input_quantizer") + and hasattr(m, "weight") + and isinstance(m.weight, torch.nn.Parameter) + ): + by_dp[("dense", m.weight.data_ptr())].append(m) + + def _merge(quantizers: list) -> bool: + """Max-merge amaxes across the quantizer list. Returns True on merge.""" + valid = [ + q + for q in quantizers + if q is not None + and getattr(q, "is_enabled", False) + and getattr(q, "_amax", None) is not None + and not q._amax.is_meta + ] + if len(valid) < 2: + return False + # Require scalar (per-tensor) amax — matches preprocess_linear_fusion. + if any(q._amax.numel() != 1 for q in valid): + warnings.warn( + "sync_tied_input_amax: non-scalar input_quantizer amax encountered " + "in a tied group; skipping. Only per-tensor input quantizers are " + "supported for tied-modules merging." + ) + return False + merged = torch.max(torch.stack([q.amax for q in valid])) + for q in valid: + q.amax = merged.clone() + return True + + synced = 0 + for key, modules in by_dp.items(): + if len(modules) < 2: + continue + if key[0] == "moe": + for q_name in ("gate_up_proj_input_quantizer", "down_proj_input_quantizer"): + if _merge([getattr(m, q_name, None) for m in modules]): + synced += 1 + elif _merge([m.input_quantizer for m in modules]): + synced += 1 + return synced + + def _process_quantized_modules( model: nn.Module, dtype: torch.dtype, @@ -1047,6 +1122,15 @@ def _export_transformers_checkpoint( f"Taking element-wise max of amaxes for serving-engine fusion." ) + # Merge per-side input_quantizer amaxes BEFORE _process_quantized_modules, + # so the merged value flows into input_scale derivation downstream. + synced_input = sync_tied_input_amax(model) + if synced_input: + print( + f"sync_tied_input_amax: max-merged input_quantizer amaxes across " + f"{synced_input} tied module group(s)" + ) + # Process all quantized modules and export weights _process_quantized_modules(model, dtype, is_modelopt_qlora) From d684477e877f5831f0c4825d60aba757c797a698 Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Fri, 5 Jun 2026 21:55:41 +0000 Subject: [PATCH 6/8] quantization: exclude self_conditioning from default disabled_quantizers The diffusion self-conditioning network (block-diffusion models like DiffusionGemma) is text-only and not exercised by typical calibration data. Without exclusion its TensorQuantizers never see input, never set _amax, and export crashes at _export_quantized_weight: AttributeError: 'TensorQuantizer' object has no attribute '_amax' Companion to the upstream vision-tower / visual / embed_vision excludes already in this unit (PR #1691). Pattern is a no-op for non-diffusion models. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- .../configs/ptq/units/default_disabled_quantizers.yaml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml b/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml index e2efcb5142d..776ceeb9c72 100644 --- a/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml +++ b/modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml @@ -73,3 +73,9 @@ - parent_class: 'nn.Embedding' quantizer_name: '*' enable: false + # Diffusion self-conditioning network: text-only and not exercised by + # typical calibration; without exclusion its TensorQuantizers never see + # input and export crashes with "AttributeError: '...' has no attribute + # '_amax'". Companion to the vision excludes above. + - quantizer_name: '*self_conditioning*' + enable: false From d0a735ef0b241fbacbffb77c71e9800fa7a9fbe2 Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Sat, 13 Jun 2026 00:59:20 +0000 Subject: [PATCH 7/8] tests: cover tied-weight dedup, canonical reorder, and input-amax sync Adds the tied-modules test fixture plus 10 unit tests covering the tied-weight machinery introduced earlier in this series. tests/_test_utils/torch/quantization/tied_modules.py (new): Three small factory helpers shared by the unit tests: - make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is shared via setattr (mimics HF tie_weights() after __init__). - tie_fused_experts_3d_params(enc, dec) -- in-place tie of gate_up_proj / down_proj between two fused-experts modules (paired with the existing _SyntheticFusedExperts fixture). - wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent nn.Module with HF-style _tied_weights_keys (dict-style for the canonical case, list-style for the legacy negative case). Each factory asserts post-conditions on the tie so a misuse fails loudly at construction. tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests Commit f3e9543ab -- canonical-side reorder: - dict-style _tied_weights_keys yields patterns + canonical substrings - list-style yields no canonical info (reorder becomes a no-op) - _reorder_canonical_first puts decoder-side keys ahead of encoder-side keys Commit 3fb3ba053 -- sync_tied_input_amax: - tied Linears with divergent amaxes (2.0 vs 5.0) get both sides overwritten with the elementwise max (5.0) - untied Linears keep per-side amaxes (no-op when there's no tie) Commit 29674a7e1 -- dense Linear tied-weight dedup: - tied Linears share data_ptr for packed .weight + scale buffers - untied Linears keep independent data_ptrs - asymmetric quant: unquantized side early-returns at QUANTIZATION_NONE, stays at the original shared Parameter tests/unit/torch/quantization/plugins/test_fused_experts.py (extended): 2 tests Commit 10a8fdbd5 -- MoE experts dedup: - two _SyntheticSparseMoeBlock instances with tied 3-D source params share data_ptr across every per-expert buffer - untied counterparts keep independent per-expert data_ptrs Pure-Python; CPU-only; ~1s wall total. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- .../torch/quantization/tied_modules.py | 115 +++++++++++ .../torch/export/test_unified_export_hf.py | 184 ++++++++++++++++++ .../plugins/test_fused_experts.py | 111 +++++++++++ 3 files changed, 410 insertions(+) create mode 100644 tests/_test_utils/torch/quantization/tied_modules.py create mode 100644 tests/unit/torch/export/test_unified_export_hf.py diff --git a/tests/_test_utils/torch/quantization/tied_modules.py b/tests/_test_utils/torch/quantization/tied_modules.py new file mode 100644 index 00000000000..8ea76d2d459 --- /dev/null +++ b/tests/_test_utils/torch/quantization/tied_modules.py @@ -0,0 +1,115 @@ +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Factories for tied-weight test scenarios. + +These build small synthetic modules whose ``.weight`` :class:`nn.Parameter` is +shared between two sibling modules — mimicking HuggingFace's +``_tied_weights_keys`` machinery — for unit-testing the export-time dedup, +canonical-side naming, and per-side ``input_quantizer.amax`` merge logic in +the HF export path. + +Every factory returns CPU-resident, float32-default modules; no GPU required. +Each factory asserts its own post-conditions before returning, so a broken +tie surfaces as a clear factory-side error rather than as a downstream test +failure with an ambiguous cause. +""" + +import re + +import torch.nn as nn + + +def make_tied_linear_pair( + in_features: int = 16, + out_features: int = 32, + bias: bool = False, +) -> tuple[nn.Linear, nn.Linear]: + """Two :class:`nn.Linear` modules whose ``.weight`` Parameter is shared. + + Mimics what HuggingFace's :meth:`PreTrainedModel.tie_weights` does after + ``__init__``: one extra ``setattr`` so that both modules' ``.weight`` + attributes resolve to the same :class:`nn.Parameter` and therefore the + same underlying storage. The modules are otherwise independent — separate + biases (if requested), separate forward/training state, separate + quantizer slots when ``mtq.quantize`` inserts them later. + """ + enc = nn.Linear(in_features, out_features, bias=bias) + dec = nn.Linear(in_features, out_features, bias=bias) + dec.weight = enc.weight # mimics HF tie_weights() + + # Post-conditions — fail loudly if the tie was somehow lost. + assert enc.weight is dec.weight, "Linear weights not tied (object identity)" + assert enc.weight.data_ptr() == dec.weight.data_ptr(), ( + "Linear weights tied at object level but storage diverged" + ) + return enc, dec + + +def tie_fused_experts_3d_params(enc: nn.Module, dec: nn.Module) -> None: + """Tie ``gate_up_proj`` and ``down_proj`` between two fused-experts modules. + + Mutates ``dec`` in place. After calling, ``dec.gate_up_proj`` IS + ``enc.gate_up_proj`` (same :class:`nn.Parameter`) and likewise for + ``down_proj``. Used by MoE-dedup tests together with the + ``_SyntheticFusedExperts`` fixture defined in + ``tests/unit/torch/quantization/plugins/test_fused_experts.py``. + """ + dec.gate_up_proj = enc.gate_up_proj + dec.down_proj = enc.down_proj + + assert enc.gate_up_proj is dec.gate_up_proj, "gate_up_proj not tied" + assert enc.down_proj is dec.down_proj, "down_proj not tied" + assert enc.gate_up_proj.data_ptr() == dec.gate_up_proj.data_ptr() + assert enc.down_proj.data_ptr() == dec.down_proj.data_ptr() + + +def wrap_in_parent_with_tied_keys( + enc: nn.Module, + dec: nn.Module, + *, + decoder_canonical: bool = True, + weight_attr: str = "weight", +) -> nn.Module: + """Wrap two tied modules in a parent that declares HF ``_tied_weights_keys``. + + Returns a parent :class:`nn.Module` with: + + - ``parent.encoder = enc`` — registered as a submodule (alias side). + - ``parent.decoder = dec`` — registered as a submodule (canonical side + when ``decoder_canonical=True``, the default and DiffusionGemma-like case). + - ``parent._tied_weights_keys``: dict-style ``{alias_regex: canonical}`` + when ``decoder_canonical=True``, list-style (legacy, no canonical/alias + distinction) when ``decoder_canonical=False``. + + Used by tests for :func:`_collect_canonical_tied_patterns` and + :func:`_reorder_canonical_first`. The legacy list-style branch exercises + the "no patterns extracted" negative case. + """ + parent = nn.Module() + parent.encoder = enc + parent.decoder = dec + + if decoder_canonical: + # Dict-style: regex pattern → canonical path. Mimics HF's per-class + # ``_tied_weights_keys`` declaration for an encoder/decoder model. + parent._tied_weights_keys = { + rf"^encoder\.{re.escape(weight_attr)}$": f"decoder.{weight_attr}", + } + else: + # Legacy list-style: just a list of tied paths, no canonical info. + parent._tied_weights_keys = [f"encoder.{weight_attr}"] + + return parent diff --git a/tests/unit/torch/export/test_unified_export_hf.py b/tests/unit/torch/export/test_unified_export_hf.py new file mode 100644 index 00000000000..1d08a8ef620 --- /dev/null +++ b/tests/unit/torch/export/test_unified_export_hf.py @@ -0,0 +1,184 @@ +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for tied-weight helpers in unified_export_hf.""" + +from collections import OrderedDict + +import torch +from _test_utils.torch.quantization.tied_modules import ( + make_tied_linear_pair, + wrap_in_parent_with_tied_keys, +) + +import modelopt.torch.quantization as mtq +from modelopt.torch.export.unified_export_hf import ( + _collect_canonical_tied_patterns, + _export_quantized_weight, + _reorder_canonical_first, + sync_tied_input_amax, +) + + +def test_collect_canonical_tied_patterns_dict_style(): + """Dict-style _tied_weights_keys yields regex patterns + canonical-side substrings.""" + enc, dec = make_tied_linear_pair() + parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=True) + + patterns, side_substrings = _collect_canonical_tied_patterns(parent) + + assert len(patterns) >= 1 + # "decoder" is in the canonical RHS but not the alias LHS — must auto-derive. + # "encoder" is alias-only and must NOT be returned as canonical (would invert dedup). + assert "decoder" in side_substrings + assert "encoder" not in side_substrings + + +def test_collect_canonical_tied_patterns_list_style_yields_no_canonical_info(): + """Legacy list-style _tied_weights_keys carries no canonical/alias info — returns empty.""" + enc, dec = make_tied_linear_pair() + parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=False) + + patterns, side_substrings = _collect_canonical_tied_patterns(parent) + + assert patterns == [] + assert side_substrings == [] + + +def test_reorder_canonical_first_puts_decoder_keys_before_encoder_keys(): + """_reorder_canonical_first moves canonical-side state_dict keys ahead of alias-side keys.""" + enc, dec = make_tied_linear_pair() + parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=True) + + sd = OrderedDict( + [ + ("encoder.weight", torch.zeros(1)), + ("unrelated.foo", torch.zeros(1)), + ("decoder.weight", torch.zeros(1)), + ] + ) + + reordered = _reorder_canonical_first(sd, parent) + keys = list(reordered.keys()) + + assert keys.index("decoder.weight") < keys.index("encoder.weight") + assert set(reordered) == set(sd) # no drops or additions + + +def _quantize_and_get_input_quantizers(parent): + """Insert FP8 quantizers via no-op forward_loop and return both input_quantizers.""" + mtq.quantize(parent, mtq.FP8_DEFAULT_CFG, forward_loop=lambda m: None) + return parent.encoder.input_quantizer, parent.decoder.input_quantizer + + +def test_sync_tied_input_amax_max_merges_tied_module_amaxes_in_place(): + """Tied Linears with divergent input_quantizer.amax get both sides overwritten with the max.""" + enc, dec = make_tied_linear_pair() + parent = wrap_in_parent_with_tied_keys(enc, dec, decoder_canonical=True) + enc_q, dec_q = _quantize_and_get_input_quantizers(parent) + + enc_q.amax = torch.tensor(2.0) + dec_q.amax = torch.tensor(5.0) + + sync_tied_input_amax(parent) + + expected = torch.tensor(5.0) + assert torch.allclose(enc_q.amax, expected) + assert torch.allclose(dec_q.amax, expected) + + +def test_sync_tied_input_amax_no_op_for_untied_modules(): + """Untied Linears keep their per-side amaxes — the helper is a no-op when there's no tie.""" + parent = torch.nn.Module() + parent.encoder = torch.nn.Linear(16, 32, bias=False) + parent.decoder = torch.nn.Linear(16, 32, bias=False) + enc_q, dec_q = _quantize_and_get_input_quantizers(parent) + + enc_q.amax = torch.tensor(2.0) + dec_q.amax = torch.tensor(5.0) + + sync_tied_input_amax(parent) + + assert torch.allclose(enc_q.amax, torch.tensor(2.0)) + assert torch.allclose(dec_q.amax, torch.tensor(5.0)) + + +def _clear_export_quantized_weight_cache(): + """Clear the function-static alias cache; isolates each test from prior session state.""" + _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None) + + +def _calibrate_through_both_children(parent): + """Insert NVFP4 quantizers and run a one-shot forward through both children for calibration.""" + + def forward_loop(m): + x = torch.randn(2, 16) + m.encoder(x) + m.decoder(x) + + mtq.quantize(parent, mtq.NVFP4_DEFAULT_CFG, forward_loop=forward_loop) + + +def test_export_quantized_weight_aliases_packed_weight_for_tied_linears(): + """Tied Linears share data_ptr for packed .weight and scale buffers after export.""" + _clear_export_quantized_weight_cache() + + enc, dec = make_tied_linear_pair() + parent = wrap_in_parent_with_tied_keys(enc, dec) + _calibrate_through_both_children(parent) + + _export_quantized_weight(enc, torch.float16, "weight") + _export_quantized_weight(dec, torch.float16, "weight") + + assert enc.weight.data_ptr() == dec.weight.data_ptr() + for scale_attr in ("weight_scale", "weight_scale_2"): + if hasattr(enc, scale_attr) and hasattr(dec, scale_attr): + assert getattr(enc, scale_attr).data_ptr() == getattr(dec, scale_attr).data_ptr() + + +def test_export_quantized_weight_no_alias_for_untied_linears(): + """Untied Linears keep independent data_ptrs after export — no false-positive aliasing.""" + _clear_export_quantized_weight_cache() + + parent = torch.nn.Module() + parent.encoder = torch.nn.Linear(16, 32, bias=False) + parent.decoder = torch.nn.Linear(16, 32, bias=False) + assert parent.encoder.weight.data_ptr() != parent.decoder.weight.data_ptr() + _calibrate_through_both_children(parent) + + _export_quantized_weight(parent.encoder, torch.float16, "weight") + _export_quantized_weight(parent.decoder, torch.float16, "weight") + + assert parent.encoder.weight.data_ptr() != parent.decoder.weight.data_ptr() + + +def test_export_quantized_weight_skips_alias_when_one_tied_side_is_unquantized(): + """Unquantized side early-returns; its .weight stays at the original shared Parameter.""" + _clear_export_quantized_weight_cache() + + enc, dec = make_tied_linear_pair() + parent = wrap_in_parent_with_tied_keys(enc, dec) + original_shared_data_ptr = enc.weight.data_ptr() + + _calibrate_through_both_children(parent) + # is_enabled is a read-only property; .disable() is the canonical bypass. + dec.weight_quantizer.disable() + + _export_quantized_weight(enc, torch.float16, "weight") + _export_quantized_weight(dec, torch.float16, "weight") + + assert enc.weight.data_ptr() != original_shared_data_ptr # encoder got fresh packed + assert dec.weight.data_ptr() == original_shared_data_ptr # decoder untouched + assert enc.weight.data_ptr() != dec.weight.data_ptr() diff --git a/tests/unit/torch/quantization/plugins/test_fused_experts.py b/tests/unit/torch/quantization/plugins/test_fused_experts.py index ce23f7a51d5..ef2df36090e 100644 --- a/tests/unit/torch/quantization/plugins/test_fused_experts.py +++ b/tests/unit/torch/quantization/plugins/test_fused_experts.py @@ -22,6 +22,8 @@ pytest.importorskip("transformers") +from _test_utils.torch.quantization.tied_modules import tie_fused_experts_3d_params + import modelopt.torch.quantization as mtq from modelopt.torch.export.moe_utils import _export_fused_experts from modelopt.torch.export.quant_utils import get_quant_config @@ -514,6 +516,115 @@ def _spy_export(wrapper, dtype): QuantModuleRegistry.unregister(expert_type) +# --------------------------------------------------------------------------- +# Tests for tied-experts dedup in _export_fused_experts +# --------------------------------------------------------------------------- +def _build_two_moe_blocks(tie: bool) -> nn.Module: + """Build a parent with two _SyntheticSparseMoeBlock children, optionally with tied 3-D params.""" + parent = nn.Module() + parent.encoder = _SyntheticSparseMoeBlock() + parent.decoder = _SyntheticSparseMoeBlock() + if tie: + tie_fused_experts_3d_params(parent.encoder.experts, parent.decoder.experts) + return parent + + +def _moe_fp8_quant_cfg(): + """Custom inline FP8 cfg targeting the MoE-specific quantizer names.""" + return { + "quant_cfg": [ + {"quantizer_name": "*", "enable": False}, + { + "quantizer_name": "*gate_up_proj_input_quantizer", + "cfg": {"num_bits": 8, "axis": None}, + }, + {"quantizer_name": "*down_proj_input_quantizer", "cfg": {"num_bits": 8, "axis": None}}, + {"quantizer_name": "*gate_up_proj_weight_quantizer", "cfg": {"num_bits": 8, "axis": 0}}, + {"quantizer_name": "*down_proj_weight_quantizer", "cfg": {"num_bits": 8, "axis": 0}}, + ], + "algorithm": "max", + } + + +def _calibrate_two_moe_blocks(parent): + """Fire one calibration batch through both encoder.experts and decoder.experts.""" + + def forward_loop(m): + torch.manual_seed(0) + x = torch.randn(1, 4, HIDDEN_DIM) + m.encoder(x) + m.decoder(x) + + mtq.quantize(parent, _moe_fp8_quant_cfg(), forward_loop=forward_loop) + + +def _clear_fused_experts_caches(): + """Clear function-static alias caches in both export entry points.""" + _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None) + # _export_fused_experts internally calls _export_quantized_weight per per-expert + # wrapper; clear that cache too so each test sees a pristine state. + from modelopt.torch.export.unified_export_hf import _export_quantized_weight + + _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None) + + +class TestExportFusedExpertsTiedDedup: + @staticmethod + def _cleanup_registry(mod_type): + if QuantModuleRegistry.get(mod_type) is not None: + QuantModuleRegistry.unregister(mod_type) + + def test_per_expert_buffers_share_data_ptr_for_tied_fused_experts(self): + """Two tied FusedExperts modules: every per-expert .weight + scale buffer shares data_ptr.""" + _clear_fused_experts_caches() + parent = _build_two_moe_blocks(tie=True) + expert_type = type(parent.encoder.experts) + self._cleanup_registry(expert_type) + try: + _calibrate_two_moe_blocks(parent) + + _export_fused_experts(parent.encoder.experts, torch.float16) + _export_fused_experts(parent.decoder.experts, torch.float16) + + for idx in range(NUM_EXPERTS): + enc_expert = getattr(parent.encoder.experts, str(idx)) + dec_expert = getattr(parent.decoder.experts, str(idx)) + for proj_name in ("gate_proj", "up_proj", "down_proj"): + enc_proj = getattr(enc_expert, proj_name) + dec_proj = getattr(dec_expert, proj_name) + assert enc_proj.weight.data_ptr() == dec_proj.weight.data_ptr() + for scale_attr in ("weight_scale", "weight_scale_2"): + if hasattr(enc_proj, scale_attr) and hasattr(dec_proj, scale_attr): + assert ( + getattr(enc_proj, scale_attr).data_ptr() + == getattr(dec_proj, scale_attr).data_ptr() + ) + finally: + self._cleanup_registry(expert_type) + + def test_per_expert_buffers_have_independent_data_ptrs_for_untied_fused_experts(self): + """Two untied FusedExperts modules: per-expert buffers stay independent (no false-positive alias).""" + _clear_fused_experts_caches() + parent = _build_two_moe_blocks(tie=False) + expert_type = type(parent.encoder.experts) + self._cleanup_registry(expert_type) + try: + _calibrate_two_moe_blocks(parent) + + _export_fused_experts(parent.encoder.experts, torch.float16) + _export_fused_experts(parent.decoder.experts, torch.float16) + + for idx in range(NUM_EXPERTS): + enc_expert = getattr(parent.encoder.experts, str(idx)) + dec_expert = getattr(parent.decoder.experts, str(idx)) + for proj_name in ("gate_proj", "up_proj", "down_proj"): + enc_proj = getattr(enc_expert, proj_name) + dec_proj = getattr(dec_expert, proj_name) + assert enc_proj.weight.data_ptr() != dec_proj.weight.data_ptr() + finally: + self._cleanup_registry(expert_type) + + # --------------------------------------------------------------------------- # Tests for force_eager_experts_impl_on_the_fly # --------------------------------------------------------------------------- From 0543907bec98a4cd9a20fb19f9c6e8f1dc6bb9d0 Mon Sep 17 00:00:00 2001 From: Juhi Mittal Date: Sat, 13 Jun 2026 01:00:01 +0000 Subject: [PATCH 8/8] docs: add CHANGELOG entry for tied-weight PTQ support Adds a New Features bullet under 0.46 covering the tied-weight dedup, canonical-side reorder, sync_tied_input_amax helper, and the *self_conditioning* default exclude introduced earlier in this series. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Juhi Mittal --- CHANGELOG.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 49c58586674..b218cdbb5bb 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -8,6 +8,7 @@ Changelog - Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred. - Add **streaming** speculative-decoding training (EAGLE3 / DFlash): the draft trains on base-model hidden states produced on the fly by a co-located ``vllm serve`` (no disk dump), moved trainer-side over NIXL RDMA, scaling to multi-node (dedicated serve replicas + DDP trainers). New launcher examples for NVFP4 Kimi-K2.5 / K2.6 on GB200/aarch64 under ``tools/launcher/examples/moonshotai/``. +- Add tied-weight PTQ and HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF ``_tied_weights_keys``. ``_export_quantized_weight`` and ``_export_fused_experts`` now alias bit-identical packed ``weight`` / ``weight_scale`` / ``weight_scale_2`` buffers across modules sharing a source weight ``data_ptr()`` so the downstream ``postprocess_state_dict`` dedup catches them (~42% storage reduction on ``nvfp4_experts_only`` for tied 26B MoE checkpoints). New ``sync_tied_input_amax`` helper max-merges per-side ``input_quantizer.amax`` across tied modules before export so single-backbone consumers that load one ``input_scale`` per parameter don't clip either side. Opt-in ``--canonical_tied_naming`` flag (default off) reorders the state_dict so canonical-side keys per HF's ``_tied_weights_keys`` declaration win the data_ptr dedup. ``default_disabled_quantizers`` gains a ``*self_conditioning*`` wildcard companion to the upstream vision excludes (PR #1691). ``hf_ptq.py`` also unwraps ``ModelOutput`` dataclasses from ``.generate()`` so the preview decode works on diffusion models. Non-tied models see no behavioral change. 0.45 (2026-06-xx) ^^^^^^^^^^^^^^^^^