Add Vision Support for Minimax-M3 by timkhronos · Pull Request #25113 · ggml-org/llama.cpp

timkhronos · 2026-06-28T23:11:44Z

Overview

Implement MiniMax-M3 vision support. The vision tower itself is a Qwen2.5-VL style ViT (now reuses build_vit). The major differences are that M3 uses a 3-axis (T/H/W) RoPE, a gate-less GELU-erf FFN and a two-stage patch-merge projector.

Stacked on #24908, so the full diff carries the MSA base until that merges. The vision-only changes are [here]

Additional information

The preprocessing matches Qwen2.5-VL's. Further, in the graph the summed-temporal-Conv2D patch embed, the 2×2 spatial-merge reorder, separate biased q/k/v attention, and pre-LN should also match.

Expanding a bit on the differences, the most substantial one is the 3-axis RoPE. The 3 bands are laid out as cat([f,f]) with a HF split half pairing and an axis_dim keyed frequency schedule. I don't think this can reuse the existing qwen ggml_rope_multi and the ggml_rope_type_vision, as the existing op can't express it without a q/k weight permute at conversion plus a vision mode that doesn't exist. The graph-level cos/sin matches HF directly, and uses the same approach build_rope_2d already uses for the 2-axis vision rope, generalized to 3 axes. T is the temporal axis, and for still images it's coordinate 0, but the layout should stay so H/W keep the same channels as HF.

Vision MLP is a plain GELU-erf, while qwen2.5vl uses a gated FFN.

The projector itself is a two-stage projector. Uses per patch MLP (mm.1 / mm.2), 2×2 group concat, then merge MLP (mm.merge.fc1 / fc2), both using GELU-erf, while qwen uses a single post-merge MLP.

There is also no post-layer norm and no window attention, only pre_layernorm.

Validation

The metrics below are for the pre build_vit change. Will retest.
Generated vision embeddings vs the HF reference on an identical sample image:

shape : 256 tokens x 6144 embd
overall cosine : 0.999949
per-token cosine: mean=0.999454 min=0.963887 (worst token 95)
relative L2 err : 0.010137
abs err : mean=0.03815 max=15.16844
(The high max-abs is most likely a single high-magnitude channel; cosine and relative-L2 are the
embedding-level metrics.)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES.

AI assistance disclosure

AI assistance was used during development, but the code is not an unreviewed AI-generated code drop.

Scope of AI assistance:

Helped write and debug a small local python comparison script, that was used to dump MiniMax-M3-VL vision embeddings from the Hugging Face implementation for parity checks.
Helped review and organize explanations of the implementation while I was debugging.
Helped reason through code comments and possible wording.
The submitted llama.cpp implementation was not accepted blindly from AI output.
I can explain the implemented code paths and the MiniMax/HF behavior they are matching.

If a stricter or differently formatted disclosure is preferred, please specify the exact wording/fields expected.

Text-only port that re-uses existing components: MiniMax-M2 style GQA with per-head QK-norm and partial rotary, DeepSeek-V3 style leading-dense and routed/shared experts, and swigluoai activation. Sparse attention is not yet supported (dense fallback); vision tower and MTP heads are dropped.

…ch per group block picking

4-way paths. Full debug harness remains at <8136a9c68ed7a5eb009aa67bba3fda8062f4648f> for reproducing the selection-parity validation.

Note: All GGUFs generated before this change will need to be regenerated.

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

ngxson

feels like > 90% of the code is AI-generated, do you really understand it?

ngxson · 2026-06-29T08:47:24Z

+// Apply MiniMax-M3 3D RoPE using host-precomputed cos/sin (filled in set_input).
+//   x        : [d_head, n_head, n_pos]
+//   rope_cos : [rope_dim, 1, n_pos]   (rope_dim = 3*axis_dim = 78, broadcasts over heads)
+// First rope_dim dims are rotated (HF block rotate_half); the tail passes through.
+ggml_tensor * clip_graph_minimax_m3::apply_rope(
+        ggml_tensor * x, ggml_tensor * rope_cos, ggml_tensor * rope_sin) {


this sounds like slop AI-generated code; unless you can explain with your own words how it's different from qwen-vl's mrope, I'm not convinced that we need a dedicated apply_rope here

As far as I can tell from the HF file, it M3 isn't using qwen's mrope. The existing ggml_rope_multi with ggml_rope_type_vision is a 4-section split using ggml's own channel pairing, while M3's HF rope is 3 axes (T/H/W) on a shared axis_dim, laid out as cat([f,f]) and rotated split half (dim i paired with i+39), over the first 78 of the 80 dims.

Using the existing native op would mean permuting the q/k weights at conversion to match ggml's pairing.

Also, while T is the temporal axis, and it's coordinate is 0 and does nothing for still images, I believe it has to stay in the layout or H/W will be on different channels than what HF lands them on.

I'm not convinced by your explanation.

I asked claude to explain the differences among 3 modes:

note that GGML_ROPE_TYPE_VISION is NOT 4 dim, only 2 first dims are actually used.

unless if you can prove me wrong: sin/cos grids are NOT necessary. this rope impl can be done by splitting the tensor into different sections then apply rope independently for each part. look at gemma4v.cpp and build_rope_2d

2 possible implementations (I haven't tried):

split tensor into 4 sections, use ggml_rope_ext for w and h parts, then ggml_concat back

spit into 2 sections (t, h) and (w, p) then GGML_ROPE_TYPE_VISION on each part then ggml_concat back

I think the best way is to permute q/k at conversion, which let's the runtime use rope_ext normally. The HF layout is [Ta Ha Wa | Tb Hb Wb | pad], which is technically a 3-section Neox RoPE, but each axis’s paired channels is not contiguous. To be able to make the runtime graph use the native ggml_rope_ext, I permute Q/K projection weights and biases into [Ta Tb | Ha Hb | Wa Wb | pad], which makes each T/H/W axis a contiguous 26-dim Neox block. Since it's applied only to q/k, it cancels in Q·K^T, so attention output remains unchanged.

This way the graph can pass pos_t, pos_h, and pos_w directly to apply_rope(), slice Q/K into T/H/W/pad sections, apply ggml_rope_ext in Neox mode to each 26-dim section, and concatenate the result back.

Pushed it in 296f98b.

I also checked image understanding, including OCR and object positioning, and did not see any regressions versus the previous implementation.

Claude was used to help reason through the equivalence and draft the conversion side permutation code.

ngxson · 2026-06-29T08:58:10Z

A standard ViT backbone (separate biased q/k/v/o, LayerNorm, GELU MLP, full bidirectional attention, no mask, no windowing) that diverges from vanilla CLIP in four ways:

* **Conv3D patch embed, run as summed Conv2D slices.** The HF model uses a Conv3D   patch embedding with `temporal_patch_size` slices, conversion splits the 5D weight into per-slice Conv2D kernels (`V_ENC_EMBD_PATCH` + `.weight.{t}`) and the graph sums the outputs. Exact for still images (video out of scope). No patch-embed bias (asserted).

* **Custom 3-axis (T/H/W) RoPE.** `axis_dim = 26`, `rope_dim = 3·26 = 78`, applied to the first 78 channels of each head with HF `rotate_half` semantics, tail passed through. Cos/sin are host-precomputed and fed as graph inputs (`minimax_cos`/`minimax_sin`). Since `rope_dim (78) < d_head (80)` this is partial rotary,  same pattern as the text tower, 3-axis.

* **2×2 spatial-merge token reduction.** Patches are reordered raster -> block (matching the HF flatten) and merged 2×2, so the projector consumes groups of 4. `spatial_merge_size` is emitted in conversion.

* **No class token, no absolute position table, no post-layernorm.** A `pre_layernorm` only; sinks / abs-pos / class-embedding all absent and asserted null.

@timkhronos did you even read what AI generates here? and ask yourself how wrong it is?

there is no "vanilla CLIP" in mtmd, this model is just qwen-vl with some subtle differents

Rewrite code comment based on feedback and to better reflect the actual architecture, and reuse existing build_vit

ngxson · 2026-06-29T11:05:27Z

I refuse to proceed with this PR until you being honest about AI usage

timkhronos · 2026-06-29T12:45:05Z

@ngxson I expanded on the AI disclosure in the PR description, to cover precisely what AI was used to assist with.

timkhronos · 2026-07-01T14:04:23Z

For anyone testing, the latest changes require the mmproj file to be reconverted.

I'll upload one later, but for the time being you can convert a fresh one with this pr, by using

python convert_hf_to_gguf.py MiniMaxAI/MiniMax-M3 --mmproj --remote --outtype bf16 --outfile Minimax-mmproj.gguf

danielhanchen and others added 30 commits June 22, 2026 15:34

MiniMax-M3 vision tower (mmproj + clip graph)

53c81dd

Delete m3_vision_ref.py

f07f1d4

Update clip.cpp

4a3206f

MSA

bbf1a80

Update constants.py

09657dd

Update minimax.py

8fe2e01

Cache creation. Working withotu flash attention

8c953a9

Added flash attention for sparse layers

ea6fbd6

Decomposed slow cpu OP into GPU + CPU ops. Massive speedup over long ctx

2e82759

Rewrote indexer op to be cuda native. Modified flash attention to mat…

0152226

…ch per group block picking

Implement sparse attention calc out of stock ops.

d1a04f7

Fix a cache allocation and cont issue

b1b174e

Fixed -fa auto crash, flagged debug spots

cea714a

Delete vocab.json

69c958d

Delete model.safetensors.index.json

afc09f1

Delete generation_config.json

f40d0f5

Delete Minimax directory

714bbe9

Handled multi stream case to fall back on Dense Attention

8136a9c

Development scaffolding cleanup. No functional change to the decode or

3ed9b18

4-way paths. Full debug harness remains at <8136a9c68ed7a5eb009aa67bba3fda8062f4648f> for reproducing the selection-parity validation.

Remove redundant comment from minimax-m3.cpp

79a6eec

Changed 3 Gelu Ops for vision into Gelu_erf ops

35990be

Assert that n_kv is multiple of 128

fa15850

Rename MSA index tensors to indexer convention

eae55e2

Note: All GGUFs generated before this change will need to be regenerated.

Fix incorrect Assert

2bb7eeb

Review driven changes (#3)

d6f9426

Remove comment from conversion minimax.py

0a7b2dc

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Remove whitespaces from constants.py

636143a

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Tighten comment in minimax.py

7b7ff65

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

inherit MiniMax-M3 from MiniMax-M2

1cd03ef

timkhronos and others added 8 commits June 28, 2026 21:41

Fix conversion error /gguf_writer.py

618e145

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update gguf-py/gguf/gguf_writer.py

5a24782

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update gguf-py/gguf/tensor_mapping.py

5dfe838

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update conversion/minimax.py

b99a8d5

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update conversion/minimax.py

56ba541

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Remove whitespace in src/llama-kv-cache.cpp

044391a

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Remove Whitespace in Update src/llama-model.h

c314faf

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Remove whitespace in src/llama-hparams.h

5015a6b

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

github-actions Bot added model Model specific testing Everything test related mtmd Related to multimodal functionality (video/image/audio) conversion labels Jun 28, 2026

This comment was marked as resolved.

Sign in to view

timkhronos mentioned this pull request Jun 28, 2026

Add MiniMax-M3 (MSA: MiniMax Sparse Attention) support #24908

Open

ngxson requested changes Jun 29, 2026

View reviewed changes

Update minimax_m3.cpp

768346a

Rewrite code comment based on feedback and to better reflect the actual architecture, and reuse existing build_vit

ngxson reviewed Jun 29, 2026

View reviewed changes

Comment thread tools/mtmd/models/minimax-m3.cpp

ngxson reviewed Jun 29, 2026

View reviewed changes

Comment thread tools/mtmd/clip.cpp Outdated

ngxson reviewed Jun 29, 2026

View reviewed changes

Comment thread tools/mtmd/clip.cpp Outdated

timkhronos added 7 commits June 29, 2026 17:33

Rename minimax_m3.cpp to minimax-m3.cpp

e819f32

Update CMakeLists.txt

492e357

Remove debug code from clip.cpp

0d64726

Update clip.cpp

98e1571

Update comments in tools/mtmd/models/minimax-m3.cpp

9693dc2

Merge branch 'master' into MSA-Vision

06833d2

Permute Q/K at conversion, drop precomputed sin/cos

296f98b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Vision Support for Minimax-M3#25113

Add Vision Support for Minimax-M3#25113
timkhronos wants to merge 50 commits into
ggml-org:masterfrom
timkhronos:MSA-Vision

timkhronos commented Jun 28, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

ngxson left a comment

Uh oh!

Uh oh!

ngxson Jun 29, 2026

Uh oh!

timkhronos Jun 29, 2026

Uh oh!

ngxson Jun 29, 2026

Uh oh!

timkhronos Jul 1, 2026

Uh oh!

Uh oh!

ngxson commented Jun 29, 2026

Uh oh!

ngxson commented Jun 29, 2026

Uh oh!

timkhronos commented Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timkhronos commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

timkhronos commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Validation

Requirements

AI assistance disclosure

Uh oh!

This comment was marked as resolved.

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

timkhronos Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

timkhronos Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Jun 29, 2026

Uh oh!

ngxson commented Jun 29, 2026

Uh oh!

timkhronos commented Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timkhronos commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timkhronos commented Jun 28, 2026 •

edited

Loading

timkhronos commented Jul 1, 2026 •

edited

Loading