Add MiniMax-M3 (MSA: MiniMax Sparse Attention) support by timkhronos · Pull Request #24908 · ggml-org/llama.cpp

timkhronos · 2026-06-22T15:04:43Z

Overview

Adds support for MiniMax-M3, a 60-layer / 128-expert MoE (3 dense + 57 MoE layers) using MiniMax Sparse Attention (MSA), a GQA-based block-sparse attention where a lightweight per-GQA-group indexer selects a small set of key/value blocks (top-k), and the main attention runs only over the selected blocks. This PR implements the architecture from the MSA paper arXiv:2606.13392v2 and follows the transformers reference (modular_minimax_m3_vl) for the indexer construction, selection, and RoPE.

Vision support moved to separate PR #25113

This implements MiniMax-M3 support: text tower (MSA attention + MoE) ~~and the vision tower / mmproj (CLIP-style ViT, Conv3d patch embed, 3D-RoPE, the two-stage patch-merge projector).~~ MTP heads are not currently present in the released model file.

Additional information

What MSA does (and why the code looks the way it does)

Per sparse layer, for each query and GQA group:

Index branch scores the causal context against a single shared index-key head via a small Q_idx·K_idxᵀ, max-pools scores into blocks ( with each block size being: block_size = 128), and selects top_k = 16 blocks. The local (query's own) block is always force-included.
Main branch attends only over the tokens in the selected blocks, which is a fixed budget of k·B_k = 16·128 = 2048 keys per query, independent of context length, meaning decode attention cost stops scaling with context.

This PR implements two execution paths, which are selected at runtime:

MSA Decode path

(build_attn_msa_decode, single-token batches ONLY): top-k -> gather the 2048 selected KV -> flash attention. This is the fast path, which only attends over the tokens in the selected blocks.

4-way / prefill path

(build_attn_msa_4way, single or multi-token batches): builds a per-group block mask and runs masked flash attention over the context. By default, only used for prefill (n_tokens > 1) and as the simpler, more directly-auditable reference path.

Both are per-GQA-group (4 groups, one index selection each, shared across the group's 16 query heads), matching the reference.

Correctness

vs. paper + HF reference: indexer projections/norms, partial-RoPE on the index heads, block max-pool, per-group top-k, forced-local-block, and the 1/√d omission (order-preserving, irrelevant to top-k) all follow the reference. The +1 Gemma-RMSNorm baking is applied to the indexer norms during conversion (index_{q,k}_norm); dumped converted weights show: (means ≈ 1.2, indistinguishable from attn_q_norm).

Selection parity, MSA decode vs. 4-way: SELDUMP(dumps the selected blocks for every index head) was ran on the first decoded token at a fixed prefix, and shows the two selectors agree except for occasional single-block flips at the rank-16 boundary, which are TF32-vs-f32 score-precision ties only appearing near the marginal, lowest-mass blocks. Forced blocks (sink, local) and high-mass blocks are always identical.

Debugs (MSA_DUMP_SEL, MSA_VERIFY_BS4, MSA_BYPASS, MSA_FORCE_4WAY) are left in, tagged with DEBUGREMOVE, I'll strip them before merge on request.

Index KV-cache (notable implementation detail)

The indexer needs its own per-layer key cache (single head, index_head_size = 128). It is allocated on the same backend buffer as the main K/V cache, not on a host buffer. An earlier host-resident version caused a full [d_idx, n_kv] host->device copy of the index cache on every sparse layer, every decode step, which dominated long-context decode (a ~26 ms/token slope at 30k that grew with context). Keeping it co-resident with K/V eliminates that.

However this introduced a separate issue. The cache is a strided view into the ring buffer, and an M=4 (index-heads) matmul against a strided operand falls off the fast path. The solution I found, is to have the index-cache view fed to the scoring mul_mat be forced contiguous (ggml_cont) before the matmul. Without it, long-context decode regresses badly even with the cache on-GPU. That cont is load bearing.

Flash-attention requirement

The MSA in this PR requires flash attention. Both build_attn_msa_decode and build_attn_msa_4way call ggml_flash_attn_ext directly and assume the non-transposed V layout that llama.cpp only provides when FA is enabled (v_trans = !cparams.flash_attn). There is intentionally no full-attention-materialization fallback, as the only environment that can't run flash_attn_ext is CPU-class, and materializing full attention scores 57×/token over long context, would perform catastrophically there. Dense fallback is strictly better behavior in that case.

The PR gates MSA on resolved cparams.flash_attn:

FA on -> MSA (decode for single-token, 4way for prefill).
FA off -> dense build_attn (no sparsity) + one-time warning that output is degraded.
MSA_BYPASS forces dense; MSA_FORCE_4WAY forces the 4way path at decode (debug flags, will be stripped before merging).

The FA nodes are named with the LLAMA_TENSOR_NAME_FATTN convention so the --flash-attn auto resolver in sched_reserve recognizes them (it identifies FA nodes by name to check their device assignment). Without the naming, server auto aborts at the FATTN-name assert before ever reaching the gate.

Auto-FA resolution is verified on CUDA only. The gate keys on resolved cparams.flash_attn as mentioned earlier, so other backends should behave correctly where they support flash_attn_ext for head-dim 128 (falling back to dense otherwise), however ROCm/HIP, Vulkan, and Metal are untested. Reports from those backends welcome.

Limitations/scope

Single stream only (n_seq_max == 1 / -np 1). The MSA paths requires single-stream; multi-sequence batching is not yet handled, and falls back to dense attention.
Block selection is anchored to absolute KV slots, like the HF reference, which is correct for single-sequence, right-aligned decode, and has not been validated for left-padding / mixed-sequence layouts.
FA required (above); falls back to dense otherwise.
While decode performance speed should approach the official implementation, prefill will not.
Unquantized KV cache only: Everything was tested with f16 K/V. The MSA decode path gathers selected KV via get_rows and feeds flash_attn_ext; quantized cache types (-ctk/-ctv q8_0 etc.) are untested and not supported. The indexer's own key cache is f32, regardless of the main cache type (quantizing the selector would degrade block selection).

Quantization

Quants should keep indexer projections (index_{q,k}_proj) at fp32, they drive block selection, so quant error there changes which blocks are read (a discrete retrieval loss), unlike a value projection. Strongly recommend fp32 for those even in otherwise-aggressive quants. The weights are bf16, but fp16 conversion might introduce issues. The difference is about 1GB, and these are genuinely some of the highest value tensors in the model.

Performance

At the tested config (decode bounded by CPU-offloaded experts), MSA decode tracks the dense baseline (~8 tok/s at 20–30k), and prefill benefits slightly from the reduced sparse-attention FLOPs (~400+ tok/s). On a fully GPU-resident deployment (experts on device), MSA decode should approach the official implementation's decode characteristics by construction, both compute attention over only the fixed 2048 selected KV per query rather than the full cache. This is an architectural expectation, not a measured result; the tested config here is expert-offload-bound and only demonstrates parity with the dense baseline.

Conversion

convert_hf_to_gguf.py support for MiniMaxM3SparseForCausalLM: merges routed experts, bakes Gemma (1+w) norms (incl. indexer norms), emits indexer hparams (head_count, key_length, top_k, block_size, local_blocks) and leading_dense_block_count (derived from the leading zeros of moe_layer_freq).
~~Further added support for MiniMaxM3VisionModel.~~ Follor up PR#25113

How to test

All tests assume a GGUF of M3 specifically converted with this branch. Weights with the indexers stripped (such as those by Unsloth) will NOT work. All tests were ran against https://huggingface.co/avar6/minimax-m3-MSA-gguf , which keeps the Indexer weights. (It does incorrectly quantize the non norm indexer tensors to Q4. This is due to the GGUFs being made early on in the implementation process.) Also assumed CUDA build (-DGGML_CUDA=ON), but should work on other FA compatible backends.

Currently the existing GGUFs no longer work due to a tensor rename. New ones will have to be reconverted. Meanwhile the previous branch available over here is still able to load the Avar6 GGUFs in case anyone wants to experiment with the PR without converting their own GGUFs.

Performance can be compared directly with the dense attention. To test speed. feed a 4k and a 30k-token prompt and generate; decode throughput should stay roughly flat (the per-token attention budget is fixed at 2048 KV regardless of context). Compare against MSA_BYPASS=1 (dense) to see dense's KV-read grow while MSA stays fixed.

Debug variables:

MSA_DUMP_SEL=<token_position> # dumps per-group selected blocks for the specified token
MSA_FORCE_4WAY=1 # forces the 4way selector at decode (A/B vs decode path)
MSA_BYPASS=1 # dense attention, no MSA
Decode-path and 4way-path selections should match except for occasional single-block flips at the top-k boundary (score-precision ties). These env probes are tagged DEBUGREMOVE and will be stripped before merge.

Multimodal

Multimodal support has been moved into a separate PR upon maintained request. PR #25113

Vision tower (mmproj)

A CLIP-style ViT, separate biased q/k/v/o projections, LayerNorm, GELU MLP, full bidirectional attention (no mask, no windowing). It differs from a vanilla CLIP encoder in four specific ways, all handled:

Conv3D patch embed, run as summed Conv2D temporal slices. The HF model uses a Conv3D patch embedding with temporal_patch_size slices; conversion splits the 5D Conv3D weight into per-slice Conv2D kernels (V_ENC_EMBD_PATCH + .weight.{t}), and the graph sums the Conv2D outputs. For still images this is exact (video is out of scope here, see below). No patch-embed bias (asserted).
Custom 3D (T/H/W) RoPE. Position encoding is a 3-axis RoPE: axis_dim = 2·((2·(d_head/2)/3)/2) = 26, rope_dim = 3·axis_dim = 78, applied to the first rope_dim channels of each head with HF rotate_half semantics, tail passed through. cos/sin are host-precomputed and fed as graph inputs (minimax_cos/minimax_sin). Because rope_dim (78) < d_head (80), this is a partial rotary, the same partial-RoPE pattern as the text tower, just 3-axis.
2×2 spatial-merge token reduction. Patches are reordered raster->block (matching the HF flatten) and merged 2×2, so the projector consumes groups of 4 patches. spatial_merge_size is emitted in conversion.
No class token, no absolute position table, no pre/post-layernorm asymmetry beyond a pre_layernorm and no post_layernorm (both asserted). Sinks/abs-pos/class-embedding are all absent and asserted null.

Projector is two on-disk modules: a per-patch MLP (multi_modal_projector.linear_{1,2} -> mm.{1,2}) applied first, then the 2×2 group-of-4 concatenation, then a merge MLP (patch_merge_mlp.linear_{1,2} -> mm.merge.fc{1,2}), both GELU. Output is [proj_dim, n_pos/4] into the text embedding space.

Current vision limitation:

still images only (n_batch == 1). The Conv3D-as-summed-Conv2D path is image-correct; video (multiple temporal frames) is not yet wired and is out of scope for this PR. If preferred I could fold it in, however I believe it might be best left as a separate PR.

Vision testing

~~Requires a gguf mmproj. One can be found here: https://huggingface.co/Serpen/Minimax_M3_MMPROJ_GGUF/blob/main/MiniMaxAI-MiniMax-M3-bf16.gguf~~

Credits

The initial MiniMax-M3 scaffolding was based on work by @danielhanchen. The sparse-attention (MSA) indexer, the decode and 4-way paths, the KV-cache index handling, FA gating, and the vision tower in this PR are new work.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, Claude Opus was used to help write tests, and was used to help debug a speed related issue concerning the allocation of the indexer cache.

Text-only port that re-uses existing components: MiniMax-M2 style GQA with per-head QK-norm and partial rotary, DeepSeek-V3 style leading-dense and routed/shared experts, and swigluoai activation. Sparse attention is not yet supported (dense fallback); vision tower and MTP heads are dropped.

…ch per group block picking

ggml-gh-bot · 2026-06-22T15:09:31Z

Hi @timkhronos, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

timkhronos · 2026-06-22T15:20:20Z

Thanks. This is full support for a new model architecture (MiniMax-M3), so the core: arch definition, graph, KV-cache handling, indexer and conversion, has to land together to be functional; those aren't independently mergeable. The vision tower / mmproj is the one separable piece, but it's a small, self-contained addition (a standard CLIP-style ViT, the only model-specific parts are the patch embed, 3D-RoPE, and patch-merge projector), so I'd prefer to keep it in for complete model support in one pass rather than split it out. Of course, if a maintainer would still rather see vision as a follow-up, I'll split it.

I am happy to open a tracking issue / RFC for the architecture now if you'd like the design discussed before review, though the PR description already covers the design, correctness validation, and limitations in detail. I went straight to the PR since this began as a vision-tower change and grew into full MSA support once I traced the long-context degradation past ~6k to the sparse-attention path, but I'm glad to write up an issue retroactively if that's the preferred process.

4-way paths. Full debug harness remains at <8136a9c68ed7a5eb009aa67bba3fda8062f4648f> for reproducing the selection-parity validation.

…ollow value naming convention

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

CISC

Ok, this should be clean enough to proceed with reviewing MSA (@fairydreaming do you want to take a look?) and mmproj (unless @ngxson want that in a separate PR?).

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

usrlocalben · 2026-06-28T21:01:03Z

What should perf expectations be when testing? I tried the earlier commit w/Avar6 gguf and perf is poor: TG=6t/s and PP seems even slower (!?) R6KP(attn)+EPYC(moe) . I don't see how this could be anyone's "daily driver" as described above by @Ganju0

Ganju0 · 2026-06-28T21:11:23Z

What should perf expectations be when testing? I tried the earlier commit w/Avar6 gguf and perf is poor: TG=6t/s and PP seems even slower (!?) R6KP(attn)+EPYC(moe) . I don't see how this could be anyone's "daily driver" as described above by @Ganju0

Well I dont drive very fast. 6t/s is what I was getting

timkhronos · 2026-06-28T21:15:41Z

@usrlocalben with the Q4 gguf from Avar's repo, and the majority of model in sys ram, I have been getting 8 t/s for decode, and 450 t/s for prefill with -ngl 999 -ot "blk\.(0|1|2|3|4)\.ffn_.*=CUDA0" -ot "blk\..*_exps\.=CPU"

ngxson · 2026-06-28T21:22:05Z

Please split the multimodal changes into a dedicated PR

…arate PR

timkhronos · 2026-06-28T23:16:35Z

@ngxson I have split out the Vision support as you requested. It's draft since it needs Minimax-M3 support, and the diff carries that base until it merges, but the vision-only changes are in this compare link.

fairydreaming · 2026-06-29T06:22:08Z

Ok, this should be clean enough to proceed with reviewing MSA (@fairydreaming do you want to take a look?)

@CISC Sorry, I don't have this model downloaded and examined in detail yet, just skimmed over the paper so I'm not qualified.

ngxson · 2026-06-29T09:09:32Z

AI usage disclosure: YES, Claude Opus was used to help write tests, and was used to help debug a speed related issue concerning the allocation of the indexer cache.

Large part of the code is clearly AI-written. Please be honest about AI usage.

Proof: you manually modified AI-generated comments to make it human-like: 2e82759

(To be clear: I'm ok with using AI and I myself use it from time to time, however, dishonesty is not something I will tolerate)

CISC · 2026-06-30T14:04:22Z

The code could not load MiniMax-M3 model https://huggingface.co/avar6/minimax-m3-MSA-gguf/tree/main/Q4_K_M

Explained in #24908 (comment) and OP.

timkhronos · 2026-07-01T19:48:56Z

For anyone wishing to test, a new Q4 MOE optimized GGUF (~265GB) with the new correct tensor names, and the indexer tensors preserved at F32 is available here (Graciously provided by @AesSedai )

danielhanchen and others added 18 commits June 22, 2026 15:34

MiniMax-M3 vision tower (mmproj + clip graph)

53c81dd

Delete m3_vision_ref.py

f07f1d4

Update clip.cpp

4a3206f

MSA

bbf1a80

Update constants.py

09657dd

Update minimax.py

8fe2e01

Cache creation. Working withotu flash attention

8c953a9

Added flash attention for sparse layers

ea6fbd6

Decomposed slow cpu OP into GPU + CPU ops. Massive speedup over long ctx

2e82759

Rewrote indexer op to be cuda native. Modified flash attention to mat…

0152226

…ch per group block picking

Implement sparse attention calc out of stock ops.

d1a04f7

Fix a cache allocation and cont issue

b1b174e

Fixed -fa auto crash, flagged debug spots

cea714a

Delete vocab.json

69c958d

Delete model.safetensors.index.json

afc09f1

Delete generation_config.json

f40d0f5

Delete Minimax directory

714bbe9

timkhronos requested review from a team, CISC, JohannesGaessler and ggerganov as code owners June 22, 2026 15:04

github-actions Bot added model Model specific testing Everything test related examples python python script changes labels Jun 22, 2026

timkhronos added 2 commits June 23, 2026 02:15

Handled multi stream case to fall back on Dense Attention

8136a9c

Development scaffolding cleanup. No functional change to the decode or

3ed9b18

4-way paths. Full debug harness remains at <8136a9c68ed7a5eb009aa67bba3fda8062f4648f> for reproducing the selection-parity validation.

Reuse LLM_FFN_SWIGLU_OAI_MOE

d54b4eb

CISC reviewed Jun 28, 2026

View reviewed changes

Comment thread gguf-py/gguf/gguf_writer.py Outdated

Remove duplicate indexer setters, add only block_size/local_blocks, f…

f57efce

…ollow value naming convention

CISC reviewed Jun 28, 2026

View reviewed changes

Comment thread gguf-py/gguf/gguf_writer.py Outdated

Comment thread gguf-py/gguf/gguf_writer.py Outdated

Comment thread gguf-py/gguf/tensor_mapping.py Outdated

Comment thread conversion/minimax.py Outdated

Comment thread conversion/minimax.py Outdated

timkhronos and others added 5 commits June 28, 2026 21:41

Fix conversion error /gguf_writer.py

618e145

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update gguf-py/gguf/gguf_writer.py

5a24782

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update gguf-py/gguf/tensor_mapping.py

5dfe838

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update conversion/minimax.py

b99a8d5

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Update conversion/minimax.py

56ba541

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

CISC reviewed Jun 28, 2026

View reviewed changes

Comment thread src/llama-hparams.h Outdated

Comment thread src/llama-kv-cache.cpp Outdated

Comment thread src/llama-model.h Outdated

timkhronos and others added 3 commits June 28, 2026 22:18

Remove whitespace in src/llama-kv-cache.cpp

044391a

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Remove Whitespace in Update src/llama-model.h

c314faf

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Remove whitespace in src/llama-hparams.h

5015a6b

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

remove multimodal code upon maintainer request. Will be made as a sep…

c97a2ac

…arate PR

timkhronos mentioned this pull request Jun 28, 2026

Add Vision Support for Minimax-M3 #25113

Draft

Whitespace clean in tensor_mapping.py

fcb9ed2

CISC removed the request for review from a team June 29, 2026 06:37

CISC removed the mtmd Related to multimodal functionality (video/image/audio) label Jun 29, 2026

This comment was marked as off-topic.

Sign in to view

timkhronos added 2 commits June 30, 2026 22:37

Merge branch 'master' into MSA

51e7bf8

Merge branch 'ggml-org:master' into MSA

6b14fea

Uh oh!

Conversation

timkhronos commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

What MSA does (and why the code looks the way it does)

MSA Decode path

4-way / prefill path

Correctness

Index KV-cache (notable implementation detail)

Flash-attention requirement

The PR gates MSA on resolved cparams.flash_attn:

Limitations/scope

Quantization

Performance

Conversion

How to test

Debug variables:

Multimodal

Vision tower (mmproj)

Current vision limitation:

Vision testing

Credits

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 22, 2026

Uh oh!

timkhronos commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

usrlocalben commented Jun 28, 2026

Uh oh!

Ganju0 commented Jun 28, 2026

Uh oh!

timkhronos commented Jun 28, 2026

Uh oh!

ngxson commented Jun 28, 2026

Uh oh!

timkhronos commented Jun 28, 2026

Uh oh!

fairydreaming commented Jun 29, 2026

Uh oh!

ngxson commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

CISC commented Jun 30, 2026

Uh oh!

timkhronos commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

timkhronos commented Jun 22, 2026 •

edited

Loading

timkhronos commented Jun 22, 2026 •

edited

Loading

ngxson commented Jun 29, 2026 •

edited

Loading