Skip to content

Add Vision Support for Minimax-M3#25113

Draft
timkhronos wants to merge 50 commits into
ggml-org:masterfrom
timkhronos:MSA-Vision
Draft

Add Vision Support for Minimax-M3#25113
timkhronos wants to merge 50 commits into
ggml-org:masterfrom
timkhronos:MSA-Vision

Conversation

@timkhronos

@timkhronos timkhronos commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Overview

Implement MiniMax-M3 vision support. The vision tower itself is a Qwen2.5-VL style ViT (now reuses build_vit). The major differences are that M3 uses a 3-axis (T/H/W) RoPE, a gate-less GELU-erf FFN and a two-stage patch-merge projector.

Stacked on #24908, so the full diff carries the MSA base until that merges. The vision-only changes are [here]

Additional information

The preprocessing matches Qwen2.5-VL's. Further, in the graph the summed-temporal-Conv2D patch embed, the 2×2 spatial-merge reorder, separate biased q/k/v attention, and pre-LN should also match.

Expanding a bit on the differences, the most substantial one is the 3-axis RoPE. The 3 bands are laid out as cat([f,f]) with a HF split half pairing and an axis_dim keyed frequency schedule. I don't think this can reuse the existing qwen ggml_rope_multi and the ggml_rope_type_vision, as the existing op can't express it without a q/k weight permute at conversion plus a vision mode that doesn't exist. The graph-level cos/sin matches HF directly, and uses the same approach build_rope_2d already uses for the 2-axis vision rope, generalized to 3 axes. T is the temporal axis, and for still images it's coordinate 0, but the layout should stay so H/W keep the same channels as HF.

Vision MLP is a plain GELU-erf, while qwen2.5vl uses a gated FFN.

The projector itself is a two-stage projector. Uses per patch MLP (mm.1 / mm.2), 2×2 group concat, then merge MLP (mm.merge.fc1 / fc2), both using GELU-erf, while qwen uses a single post-merge MLP.

There is also no post-layer norm and no window attention, only pre_layernorm.

Validation

The metrics below are for the pre build_vit change. Will retest.
Generated vision embeddings vs the HF reference on an identical sample image:

shape : 256 tokens x 6144 embd
overall cosine : 0.999949
per-token cosine: mean=0.999454 min=0.963887 (worst token 95)
relative L2 err : 0.010137
abs err : mean=0.03815 max=15.16844
(The high max-abs is most likely a single high-magnitude channel; cosine and relative-L2 are the
embedding-level metrics.)

Requirements

AI assistance disclosure

AI assistance was used during development, but the code is not an unreviewed AI-generated code drop.

Scope of AI assistance:

  • Helped write and debug a small local python comparison script, that was used to dump MiniMax-M3-VL vision embeddings from the Hugging Face implementation for parity checks.
  • Helped review and organize explanations of the implementation while I was debugging.
  • Helped reason through code comments and possible wording.
  • The submitted llama.cpp implementation was not accepted blindly from AI output.
  • I can explain the implemented code paths and the MiniMax/HF behavior they are matching.

If a stricter or differently formatted disclosure is preferred, please specify the exact wording/fields expected.

danielhanchen and others added 30 commits June 22, 2026 15:34
Text-only port that re-uses existing components: MiniMax-M2 style GQA with
per-head QK-norm and partial rotary, DeepSeek-V3 style leading-dense and
routed/shared experts, and swigluoai activation. Sparse attention is not
yet supported (dense fallback); vision tower and MTP heads are dropped.
4-way paths. Full debug harness remains at <8136a9c68ed7a5eb009aa67bba3fda8062f4648f> for reproducing the
selection-parity validation.
Note: All GGUFs generated before this change will need to be regenerated.
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
timkhronos and others added 8 commits June 28, 2026 21:41
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
@github-actions github-actions Bot added model Model specific testing Everything test related mtmd Related to multimodal functionality (video/image/audio) conversion labels Jun 28, 2026
@ggml-gh-bot

This comment was marked as resolved.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like > 90% of the code is AI-generated, do you really understand it?

Comment thread tools/mtmd/models/minimax_m3.cpp Outdated
Comment thread tools/mtmd/models/minimax_m3.cpp Outdated
Comment on lines +8 to +13
// Apply MiniMax-M3 3D RoPE using host-precomputed cos/sin (filled in set_input).
// x : [d_head, n_head, n_pos]
// rope_cos : [rope_dim, 1, n_pos] (rope_dim = 3*axis_dim = 78, broadcasts over heads)
// First rope_dim dims are rotated (HF block rotate_half); the tail passes through.
ggml_tensor * clip_graph_minimax_m3::apply_rope(
ggml_tensor * x, ggml_tensor * rope_cos, ggml_tensor * rope_sin) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like slop AI-generated code; unless you can explain with your own words how it's different from qwen-vl's mrope, I'm not convinced that we need a dedicated apply_rope here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell from the HF file, it M3 isn't using qwen's mrope. The existing ggml_rope_multi with ggml_rope_type_vision is a 4-section split using ggml's own channel pairing, while M3's HF rope is 3 axes (T/H/W) on a shared axis_dim, laid out as cat([f,f]) and rotated split half (dim i paired with i+39), over the first 78 of the 80 dims.

Using the existing native op would mean permuting the q/k weights at conversion to match ggml's pairing.

Also, while T is the temporal axis, and it's coordinate is 0 and does nothing for still images, I believe it has to stay in the layout or H/W will be on different channels than what HF lands them on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by your explanation.

I asked claude to explain the differences among 3 modes:

image

note that GGML_ROPE_TYPE_VISION is NOT 4 dim, only 2 first dims are actually used.

unless if you can prove me wrong: sin/cos grids are NOT necessary. this rope impl can be done by splitting the tensor into different sections then apply rope independently for each part. look at gemma4v.cpp and build_rope_2d

2 possible implementations (I haven't tried):

  • split tensor into 4 sections, use ggml_rope_ext for w and h parts, then ggml_concat back
  • spit into 2 sections (t, h) and (w, p) then GGML_ROPE_TYPE_VISION on each part then ggml_concat back

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way is to permute q/k at conversion, which let's the runtime use rope_ext normally. The HF layout is [Ta Ha Wa | Tb Hb Wb | pad], which is technically a 3-section Neox RoPE, but each axis’s paired channels is not contiguous. To be able to make the runtime graph use the native ggml_rope_ext, I permute Q/K projection weights and biases into [Ta Tb | Ha Hb | Wa Wb | pad], which makes each T/H/W axis a contiguous 26-dim Neox block. Since it's applied only to q/k, it cancels in Q·K^T, so attention output remains unchanged.

This way the graph can pass pos_t, pos_h, and pos_w directly to apply_rope(), slice Q/K into T/H/W/pad sections, apply ggml_rope_ext in Neox mode to each 26-dim section, and concatenate the result back.

Pushed it in 296f98b.

I also checked image understanding, including OCR and object positioning, and did not see any regressions versus the previous implementation.

Claude was used to help reason through the equivalence and draft the conversion side permutation code.

Comment thread tools/mtmd/models/minimax_m3.cpp Outdated
@ngxson

ngxson commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

A standard ViT backbone (separate biased q/k/v/o, LayerNorm, GELU MLP, full bidirectional attention, no mask, no windowing) that diverges from vanilla CLIP in four ways:

* **Conv3D patch embed, run as summed Conv2D slices.** The HF model uses a Conv3D   patch embedding with `temporal_patch_size` slices, conversion splits the 5D weight into per-slice Conv2D kernels (`V_ENC_EMBD_PATCH` + `.weight.{t}`) and the graph sums the outputs. Exact for still images (video out of scope). No patch-embed bias (asserted).

* **Custom 3-axis (T/H/W) RoPE.** `axis_dim = 26`, `rope_dim = 3·26 = 78`, applied to the first 78 channels of each head with HF `rotate_half` semantics, tail passed through. Cos/sin are host-precomputed and fed as graph inputs (`minimax_cos`/`minimax_sin`). Since `rope_dim (78) < d_head (80)` this is partial rotary,  same pattern as the text tower, 3-axis.

* **2×2 spatial-merge token reduction.** Patches are reordered raster -> block (matching the HF flatten) and merged 2×2, so the projector consumes groups of 4. `spatial_merge_size` is emitted in conversion.

* **No class token, no absolute position table, no post-layernorm.** A `pre_layernorm` only; sinks / abs-pos / class-embedding all absent and asserted null.

@timkhronos did you even read what AI generates here? and ask yourself how wrong it is?

there is no "vanilla CLIP" in mtmd, this model is just qwen-vl with some subtle differents

Rewrite code comment based on feedback and to better reflect the actual architecture, and reuse existing build_vit
@ngxson

ngxson commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

I refuse to proceed with this PR until you being honest about AI usage

@timkhronos

Copy link
Copy Markdown
Contributor Author

@ngxson I expanded on the AI disclosure in the PR description, to cover precisely what AI was used to assist with.

Comment thread tools/mtmd/models/minimax-m3.cpp
Comment thread tools/mtmd/clip.cpp Outdated
Comment thread tools/mtmd/clip.cpp Outdated
@timkhronos

timkhronos commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

For anyone testing, the latest changes require the mmproj file to be reconverted.

I'll upload one later, but for the time being you can convert a fresh one with this pr, by using

python convert_hf_to_gguf.py MiniMaxAI/MiniMax-M3 --mmproj --remote --outtype bf16 --outfile Minimax-mmproj.gguf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conversion model Model specific mtmd Related to multimodal functionality (video/image/audio) testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants