Skip to content

fix(quantization): detect fused MoE experts without act_fn (MiniMax-M3)#1711

Open
Edwardf0t1 wants to merge 1 commit into
mainfrom
fix/fused-experts-no-act-fn
Open

fix(quantization): detect fused MoE experts without act_fn (MiniMax-M3)#1711
Edwardf0t1 wants to merge 1 commit into
mainfrom
fix/fused-experts-no-act-fn

Conversation

@Edwardf0t1

@Edwardf0t1 Edwardf0t1 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug fix

Fused MoE expert auto-detection (register_fused_experts_on_the_fly_is_fused_experts_module) required every fused-expert container to expose an act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) applies a custom GPT-OSS-style gated activation (_apply_gate, swiglu with clamp/alpha) between its two F.linear calls instead of exposing act_fn, so it failed detection and was never wrapped as _QuantFusedExperts. Consequences:

  • routed experts stayed unquantized — an experts-only recipe (*mlp.experts*) matched nothing (only KV-cache quant applied), and
  • HF export raised NotImplementedError: MoE model with experts type 'MiniMaxM3VLExperts' is not supported in export.

_QuantFusedExperts is activation-agnostic — it only intercepts the two F.linear calls (gate_up then down, in strict alternation) and never touches act_fn. So the act_fn requirement was unnecessary. This PR drops it (keeping the num_experts + 3-D gate_up_proj/down_proj checks), which enables NVFP4/FP8 PTQ and export for MiniMax-M2 / MiniMax-M3.

Usage

# MiniMax-M3 experts-only NVFP4 + FP8 KV now quantizes and exports:
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path MiniMaxAI/MiniMax-M3 \
    --recipe general/ptq/nvfp4_experts_only_mse-kv_fp8_cast \
    --calib_size 512 --export_path ./m3-nvfp4

Testing

  • Updated tests/unit/torch/quantization/plugins/test_fused_experts.py: the previous test_module_missing_act_fn_not_detected (which asserted the old, now-incorrect behavior) is replaced by test_module_missing_act_fn_still_detected, asserting that a fused-expert module without act_fn is detected. Negative cases (2-D gate_up, plain nn.Linear) still rejected.
  • Verified end-to-end on MiniMaxAI/MiniMax-M3 (~428B-A23B, transformers 5.12.0, 8×B300): detection logs Detected fused MoE experts ... of type MiniMaxM3VLExperts, all 57 MoE layers' experts quantize to NVFP4 (21,888 expert weights with scales; 0 on attention/shared-experts/vision tower), KV cache to FP8, and the HF checkpoint exports successfully (854 GB → 260 GB).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅ (updated the inverted detection test)
  • Did you update Changelog?: ✅ (0.46 Bug Fixes)
  • Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

conversion.py::_normalize_fused_experts_quantizer_name already maps the per-expert gate_up_proj_weight_quantizers.N names to the singular *weight_quantizer form, so existing stock configs/recipes match the newly-detected experts with no recipe changes.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed MoE expert module detection to support additional configurations, enabling NVFP4/FP8 quantization and HuggingFace export for expert module variants that previously failed to export.

register_fused_experts_on_the_fly skipped fused-expert modules lacking an
act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom
GPT-OSS-style gated activation between its two F.linear calls instead of an
act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed
experts stayed unquantized (an experts-only recipe matched nothing) and HF
export failed with NotImplementedError.

_QuantFusedExperts is activation-agnostic (it only intercepts the two
F.linear calls, gate_up then down), so act_fn is irrelevant to quantization,
calibration, and export. Drop the requirement from _is_fused_experts_module.
Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3.

Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3
detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid
HF checkpoint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 requested a review from a team as a code owner June 14, 2026 02:26
@Edwardf0t1 Edwardf0t1 requested a review from Fridah-nv June 14, 2026 02:26
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7eb44013-3148-4fdf-bd26-9207d4ed7c4e

📥 Commits

Reviewing files that changed from the base of the PR and between 9f37fe1 and b4c9ce6.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py

📝 Walkthrough

Walkthrough

_is_fused_experts_module in huggingface.py is updated to drop the act_fn attribute requirement, so modules exposing only gate_up_proj, down_proj, and num_experts now qualify as fused experts. The unit test is updated to assert True for that case, and a changelog entry records the fix.

Changes

Fused expert detection fix

Layer / File(s) Summary
_is_fused_experts_module predicate, test, and changelog
modelopt/torch/quantization/plugins/huggingface.py, tests/unit/torch/quantization/plugins/test_fused_experts.py, CHANGELOG.rst
Removes act_fn from the boolean checks in _is_fused_experts_module and updates its docstring; replaces the test case that expected False when act_fn is absent with one asserting True using a synthetic module of only the required attributes; adds a changelog bug-fix entry.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • realAsma
  • sychen52
  • meenchen
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main fix: removing act_fn requirement for fused MoE expert detection, with the specific affected model (MiniMax-M3) noted.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR contains no security anti-patterns: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code=True in code, no eval/exec on external input, no #nosec comments, no new unsafe dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fused-experts-no-act-fn

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1711/

Built to branch gh-pages at 2026-06-14 02:30 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.54%. Comparing base (9f37fe1) to head (b4c9ce6).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1711      +/-   ##
==========================================
- Coverage   77.12%   76.54%   -0.59%     
==========================================
  Files         511      511              
  Lines       56236    56236              
==========================================
- Hits        43374    43045     -329     
- Misses      12862    13191     +329     
Flag Coverage Δ
examples 41.83% <100.00%> (-0.13%) ⬇️
gpu 57.75% <100.00%> (-0.63%) ⬇️
regression 14.69% <0.00%> (+0.06%) ⬆️
unit 54.40% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant