Add Vision Support for Minimax-M3#25113
Conversation
Text-only port that re-uses existing components: MiniMax-M2 style GQA with per-head QK-norm and partial rotary, DeepSeek-V3 style leading-dense and routed/shared experts, and swigluoai activation. Sparse attention is not yet supported (dense fallback); vision tower and MTP heads are dropped.
…ch per group block picking
4-way paths. Full debug harness remains at <8136a9c68ed7a5eb009aa67bba3fda8062f4648f> for reproducing the selection-parity validation.
Note: All GGUFs generated before this change will need to be regenerated.
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
ngxson
left a comment
There was a problem hiding this comment.
feels like > 90% of the code is AI-generated, do you really understand it?
| // Apply MiniMax-M3 3D RoPE using host-precomputed cos/sin (filled in set_input). | ||
| // x : [d_head, n_head, n_pos] | ||
| // rope_cos : [rope_dim, 1, n_pos] (rope_dim = 3*axis_dim = 78, broadcasts over heads) | ||
| // First rope_dim dims are rotated (HF block rotate_half); the tail passes through. | ||
| ggml_tensor * clip_graph_minimax_m3::apply_rope( | ||
| ggml_tensor * x, ggml_tensor * rope_cos, ggml_tensor * rope_sin) { |
There was a problem hiding this comment.
this sounds like slop AI-generated code; unless you can explain with your own words how it's different from qwen-vl's mrope, I'm not convinced that we need a dedicated apply_rope here
There was a problem hiding this comment.
As far as I can tell from the HF file, it M3 isn't using qwen's mrope. The existing ggml_rope_multi with ggml_rope_type_vision is a 4-section split using ggml's own channel pairing, while M3's HF rope is 3 axes (T/H/W) on a shared axis_dim, laid out as cat([f,f]) and rotated split half (dim i paired with i+39), over the first 78 of the 80 dims.
Using the existing native op would mean permuting the q/k weights at conversion to match ggml's pairing.
Also, while T is the temporal axis, and it's coordinate is 0 and does nothing for still images, I believe it has to stay in the layout or H/W will be on different channels than what HF lands them on.
There was a problem hiding this comment.
I'm not convinced by your explanation.
I asked claude to explain the differences among 3 modes:
note that GGML_ROPE_TYPE_VISION is NOT 4 dim, only 2 first dims are actually used.
unless if you can prove me wrong: sin/cos grids are NOT necessary. this rope impl can be done by splitting the tensor into different sections then apply rope independently for each part. look at gemma4v.cpp and build_rope_2d
2 possible implementations (I haven't tried):
- split tensor into 4 sections, use
ggml_rope_extfor w and h parts, thenggml_concatback - spit into 2 sections (t, h) and (w, p) then
GGML_ROPE_TYPE_VISIONon each part thenggml_concatback
There was a problem hiding this comment.
I think the best way is to permute q/k at conversion, which let's the runtime use rope_ext normally. The HF layout is [Ta Ha Wa | Tb Hb Wb | pad], which is technically a 3-section Neox RoPE, but each axis’s paired channels is not contiguous. To be able to make the runtime graph use the native ggml_rope_ext, I permute Q/K projection weights and biases into [Ta Tb | Ha Hb | Wa Wb | pad], which makes each T/H/W axis a contiguous 26-dim Neox block. Since it's applied only to q/k, it cancels in Q·K^T, so attention output remains unchanged.
This way the graph can pass pos_t, pos_h, and pos_w directly to apply_rope(), slice Q/K into T/H/W/pad sections, apply ggml_rope_ext in Neox mode to each 26-dim section, and concatenate the result back.
Pushed it in 296f98b.
I also checked image understanding, including OCR and object positioning, and did not see any regressions versus the previous implementation.
Claude was used to help reason through the equivalence and draft the conversion side permutation code.
@timkhronos did you even read what AI generates here? and ask yourself how wrong it is? there is no "vanilla CLIP" in mtmd, this model is just qwen-vl with some subtle differents |
Rewrite code comment based on feedback and to better reflect the actual architecture, and reuse existing build_vit
|
I refuse to proceed with this PR until you being honest about AI usage |
|
@ngxson I expanded on the AI disclosure in the PR description, to cover precisely what AI was used to assist with. |
|
For anyone testing, the latest changes require the mmproj file to be reconverted. I'll upload one later, but for the time being you can convert a fresh one with this pr, by using
|
Overview
Implement MiniMax-M3 vision support. The vision tower itself is a Qwen2.5-VL style ViT (now reuses build_vit). The major differences are that M3 uses a 3-axis (T/H/W) RoPE, a gate-less GELU-erf FFN and a two-stage patch-merge projector.
Stacked on #24908, so the full diff carries the MSA base until that merges. The vision-only changes are [here]
Additional information
The preprocessing matches Qwen2.5-VL's. Further, in the graph the summed-temporal-Conv2D patch embed, the 2×2 spatial-merge reorder, separate biased q/k/v attention, and pre-LN should also match.
Expanding a bit on the differences, the most substantial one is the 3-axis RoPE. The 3 bands are laid out as cat([f,f]) with a HF split half pairing and an axis_dim keyed frequency schedule. I don't think this can reuse the existing qwen ggml_rope_multi and the ggml_rope_type_vision, as the existing op can't express it without a q/k weight permute at conversion plus a vision mode that doesn't exist. The graph-level cos/sin matches HF directly, and uses the same approach build_rope_2d already uses for the 2-axis vision rope, generalized to 3 axes. T is the temporal axis, and for still images it's coordinate 0, but the layout should stay so H/W keep the same channels as HF.
Vision MLP is a plain GELU-erf, while qwen2.5vl uses a gated FFN.
The projector itself is a two-stage projector. Uses per patch MLP (mm.1 / mm.2), 2×2 group concat, then merge MLP (mm.merge.fc1 / fc2), both using GELU-erf, while qwen uses a single post-merge MLP.
There is also no post-layer norm and no window attention, only pre_layernorm.
Validation
The metrics below are for the pre build_vit change. Will retest.
Generated vision embeddings vs the HF reference on an identical sample image:
shape : 256 tokens x 6144 embdoverall cosine : 0.999949per-token cosine: mean=0.999454 min=0.963887 (worst token 95)relative L2 err : 0.010137abs err : mean=0.03815 max=15.16844(The high max-abs is most likely a single high-magnitude channel; cosine and relative-L2 are the
embedding-level metrics.)
Requirements
AI assistance disclosure
AI assistance was used during development, but the code is not an unreviewed AI-generated code drop.
Scope of AI assistance:
If a stricter or differently formatted disclosure is preferred, please specify the exact wording/fields expected.