Skip to content

[NV] Add MiniMax M3 B300 Dynamo vLLM recipes with performance image#1890

Open
Oseltamivir wants to merge 2 commits into
mainfrom
update/minimax-m3-b300-perf-image
Open

[NV] Add MiniMax M3 B300 Dynamo vLLM recipes with performance image#1890
Oseltamivir wants to merge 2 commits into
mainfrom
update/minimax-m3-b300-perf-image

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Reproduce PR [NV] Add MiniMax M3 B300 Dynamo vLLM recipes #1863's MiniMax-M3 MXFP8 B300 Dynamo-vLLM configuration, 16 srt-slurm recipes, runtime fixes, and B300 launcher integration on current main.
  • Keep the topology, concurrency, parallelism, CUDA graph, KV-transfer, colocation, and node-exclusion settings unchanged.
  • Use vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-7a67223 in the master config and every recipe.
  • Retain the MSA top-k contiguity runtime fix, but do not reapply the NIXL heterogeneous-TP patch because vLLM commit 7a67223 already includes vLLM #45879.
  • The Docker Hub manifest is active Linux amd64.

Validation

  • Generated all 16 matching B300 sweep entries with the requested image.
  • Confirmed all recipe containers match the master configuration.
  • Tested the MSA setup script twice against vLLM 7a67223 source to verify patching and idempotence.
  • bash -n runners/launch_b300-nv.sh
  • bash -n benchmarks/multi_node/srt-slurm-recipes/configs/minimax-m3-vllm-fixes.sh
  • python3 utils/validate_perf_changelog.py --base-ref origin/main --head-ref HEAD
  • uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/matrix_logic/ utils/changelog_gate_tests/test_validate_perf_changelog.py -q (200 passed)
  • git diff --check origin/main...HEAD

Note

Low Risk
Benchmark and launcher configuration only; no production service code. Main operational risk is Slurm job misconfiguration or failed setup-script patching on cluster runs.

Overview
Adds minimaxm3-fp8-b300-dynamo-vllm to the NVIDIA master matrix with multinode disaggregated fixed-seq-len sweeps for 1k/1k and 8k/1k, wired to 16 local srt-slurm recipe YAMLs (DEP2 prefill, TEP8 / DEP8 / DEP4 / TP4+Marlin decode topologies).

launch_b300-nv.sh gains a minimaxm3 dynamo-vLLM path: overlay recipes/vllm/minimax-m3, pin sa-submission-q2-2026, apply the NVIDIA/srt-slurm#38 node-IP fix, run minimax-m3-vllm-fixes.sh via srtctl --setup-script, and inject Slurm exclude: b300-018 (overridable via env) with a post-submit sanity check on the rendered sbatch script.

The runtime fix script is narrowed to a single idempotent MSA prefill_topk.contiguous() patch; obsolete NIXL string patches are dropped for image vllm-minimax-m3-perf-x86_64-13.0.1-7a67223 (upstream #45879). KLAUD_DEBUG.md documents that failure mode. perf-changelog.yaml records the new config key and operational notes (colocated TP4 + CUDA IPC, Marlin decode variants).

Reviewed by Cursor Bugbot for commit 7903b1f. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@Oseltamivir Oseltamivir force-pushed the update/minimax-m3-b300-perf-image branch from 832e0d8 to 03ce791 Compare June 23, 2026 02:47

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 03ce791. Configure here.

Comment thread runners/launch_b300-nv.sh
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3" recipes/vllm/minimax-m3
SRTCTL_SETUP_SCRIPT="minimax-m3-vllm-fixes.sh"
# NVIDIA/srt-slurm#38
git show 22d46ba9971615016d2339c9ffbc7b4597accfad --format= -- src/srtctl/core/ip_utils/get_node_ip.sh | git apply - || exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard fail on git apply

Medium Severity

The MiniMax M3 clone path pipes the get_node_ip.sh backport through git apply and exits the whole launcher on any apply failure. Once sa-submission-q2-2026 already contains that change (or the file diverges), apply fails and no srtctl job is submitted even though the fix is already present.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 03ce791. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant