fix(quantization): detect fused MoE experts without act_fn (MiniMax-M3)#1711
fix(quantization): detect fused MoE experts without act_fn (MiniMax-M3)#1711Edwardf0t1 wants to merge 1 commit into
Conversation
register_fused_experts_on_the_fly skipped fused-expert modules lacking an act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom GPT-OSS-style gated activation between its two F.linear calls instead of an act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed experts stayed unquantized (an experts-only recipe matched nothing) and HF export failed with NotImplementedError. _QuantFusedExperts is activation-agnostic (it only intercepts the two F.linear calls, gate_up then down), so act_fn is irrelevant to quantization, calibration, and export. Drop the requirement from _is_fused_experts_module. Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3. Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3 detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid HF checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthrough
ChangesFused expert detection fix
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1711 +/- ##
==========================================
- Coverage 77.12% 76.54% -0.59%
==========================================
Files 511 511
Lines 56236 56236
==========================================
- Hits 43374 43045 -329
- Misses 12862 13191 +329
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug fix
Fused MoE expert auto-detection (
register_fused_experts_on_the_fly→_is_fused_experts_module) required every fused-expert container to expose anact_fnattribute.MiniMaxM3VLExperts(transformers 5.12.0) applies a custom GPT-OSS-style gated activation (_apply_gate, swiglu with clamp/alpha) between its twoF.linearcalls instead of exposingact_fn, so it failed detection and was never wrapped as_QuantFusedExperts. Consequences:*mlp.experts*) matched nothing (only KV-cache quant applied), andNotImplementedError: MoE model with experts type 'MiniMaxM3VLExperts' is not supported in export._QuantFusedExpertsis activation-agnostic — it only intercepts the twoF.linearcalls (gate_up then down, in strict alternation) and never touchesact_fn. So theact_fnrequirement was unnecessary. This PR drops it (keeping thenum_experts+ 3-Dgate_up_proj/down_projchecks), which enables NVFP4/FP8 PTQ and export for MiniMax-M2 / MiniMax-M3.Usage
Testing
tests/unit/torch/quantization/plugins/test_fused_experts.py: the previoustest_module_missing_act_fn_not_detected(which asserted the old, now-incorrect behavior) is replaced bytest_module_missing_act_fn_still_detected, asserting that a fused-expert module withoutact_fnis detected. Negative cases (2-Dgate_up, plainnn.Linear) still rejected.MiniMaxAI/MiniMax-M3(~428B-A23B, transformers 5.12.0, 8×B300): detection logsDetected fused MoE experts ... of type MiniMaxM3VLExperts, all 57 MoE layers' experts quantize to NVFP4 (21,888 expert weights with scales; 0 on attention/shared-experts/vision tower), KV cache to FP8, and the HF checkpoint exports successfully (854 GB → 260 GB).Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
conversion.py::_normalize_fused_experts_quantizer_namealready maps the per-expertgate_up_proj_weight_quantizers.Nnames to the singular*weight_quantizerform, so existing stock configs/recipes match the newly-detected experts with no recipe changes.🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes