Skip to content

[AMD] Add MiniMax-M3-FP8 MI355X ATOMMESH#1865

Open
seungrokj wants to merge 35 commits into
mainfrom
amd/atom_mesh_0619_m3_fp8
Open

[AMD] Add MiniMax-M3-FP8 MI355X ATOMMESH#1865
seungrokj wants to merge 35 commits into
mainfrom
amd/atom_mesh_0619_m3_fp8

Conversation

@seungrokj

@seungrokj seungrokj commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add minimaxm3-fp8-mi355x-atom-disagg CI recipe: multi-node disaggregated prefill-decode on MI355X via ATOM for MiniMax-M3-MXFP8
  • Align server settings with slurm reference script: MEM_FRAC_STATIC=0.8, MAX_NUM_SEQS=128, BLOCK_SIZE=128, MAX_MODEL_LEN=32768, KV_CACHE_DTYPE=auto
  • server_atom.sh: fix _MAX_CONC assignment before cudagraph size check; gate ATOM_MOE_GU_ITLV and AITER_BF16_FP8_MOE_BOUND on DeepSeek-V4-Pro only; use ${KV_CACHE_DTYPE:-fp8} default
  • Search space: ISL=8192 and ISL=1024, 1P1D TP4, conc 1–512

Test plan

  • CI sweep on mi355x-disagg runner triggers correctly
  • --kv-cache-dtype is not passed when KV_CACHE_DTYPE=auto
  • Decode node cudagraph sizes scale with max concurrency

🤖 Generated with Claude Code


Note

Medium Risk
Touches shared multi-node ATOM launch paths (server_atom.sh, Slurm env) used by other disagg recipes; mis-gated env flags or server CLI changes could affect DeepSeek-V4-Pro runs, though changes are mostly additive with model-specific guards.

Overview
Adds minimaxm3-fp8-mi355x-atom-disagg to AMD CI: multi-node 1P1D TP4 prefill/decode on MI355X via ATOM + atomesh, sweeping 1k/1k and 8k/1k at conc 1–512, with a new launcher minimaxm3_fp8_mi355x_atom-disagg.sh that sets reference tuning (MEM_FRAC_STATIC=0.8, BLOCK_SIZE=128, MAX_MODEL_LEN=32768, KV_CACHE_DTYPE=auto, no MTP).

ATOM disagg plumbing is generalized in server_atom.sh, job.slurm, env_atom.sh, and bench.sh: MEM_FRACTIONMEM_FRAC_STATIC, optional MAX_MODEL_LEN / MAX_NUM_BATCHED_TOKENS, skip --kv_cache_dtype when dtype is auto, MTP/spec args and model-specific flags (DeepSeek-V4-Pro TBO/HF overrides vs AITER_QUICK_REDUCE_QUANTIZATION=INT4 for other models), BENCH_REQUEST_RATE into containers, and models_atom.yaml entries for MiniMax-M3 MXFP4/MXFP8.

Reviewed by Cursor Bugbot for commit 4beb48d. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

3 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@seungrokj seungrokj changed the title feat: MiniMax-M3 MXFP8 MI355X ATOM disaggregated PD benchmark [AMD] Add MiniMax-M3-FP8 MI355X ATOMMESH Jun 20, 2026
Comment thread benchmarks/multi_node/minimaxm3_fp4_mi355x_atom-disagg.sh

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review it in an bit but seems like we need to merge vllm disagg first?

i thought we chatted about this before about sglang/vllm native engine first back in april 17 & your thumbs up means u acholwdege the guidelines

#1043 (comment)

Image

@github-actions

Copy link
Copy Markdown
Contributor

Comment on lines +2895 to +2907
prefill:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
additional-settings:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seungrokj quick question out of curiousity: for TP4+TP4 is this over XGMI or RDMA?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functionstackx this is over RDMA across 2 nodes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seungrokj thanks for your insight! this is with the mooncake kvcache transfer engine right?

Comment thread benchmarks/multi_node/amd_utils/server_atom.sh
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cb168e4. Configure here.

Comment thread .github/configs/amd-master.yaml
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Comment thread benchmarks/multi_node/minimaxm3_fp4_mi355x_atom-disagg.sh
@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx added the all-evals Expand eval selection to every fixed-sequence config label Jun 21, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir added the evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection label Jun 21, 2026
seungrokj and others added 15 commits June 23, 2026 10:36
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…isagg.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r default)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e --enable-tbo for non-DSv4 models

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…YPE default to empty for minimaxm3 disagg

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ode node

- Change runner from mi355x to mi355x-disagg in amd-master.yaml for minimaxm3-fp4 disagg
- Add dynamic CUDAGRAPH_SIZES selection in server_atom.sh based on max concurrency thresholds (512/1024/2048)
- Pass --cudagraph-capture-sizes to decode node server args

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…4-Pro only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use ${KV_CACHE_DTYPE-fp8} so empty string (set by minimaxm3 script) is
left as-is, avoiding unintended --kv-cache-dtype pass-through.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dtype flag

Set KV_CACHE_DTYPE to auto in minimaxm3_fp4_mi355x_atom-disagg.sh and
revert server_atom.sh to use :- expansion (auto is explicitly excluded
from KV_CACHE_ARG in server_atom.sh, so the flag is not passed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- disagg.sh: export MEM_FRAC_STATIC=0.8 and MAX_NUM_SEQS=128
- server_atom.sh: fix missing _MAX_CONC assignment before cudagraph size check
- amd-master.yaml: trim ISL=8192 to 1P1D only, cap conc at 512 for both ISLs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ove stale perf-changelog entry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- amd-master.yaml: bump image to rocm/atom-dev:MiniMax-M3-20260622
- minimaxm3_fp8_mi355x_atom-disagg.sh: unconditionally set MAX_MODEL_LEN=32768
- server_atom.sh: minor comment cleanup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seungrokj seungrokj force-pushed the amd/atom_mesh_0619_m3_fp8 branch from 093756c to fa89765 Compare June 23, 2026 01:36
@seungrokj seungrokj added full-sweep-enabled and removed evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection labels Jun 23, 2026
@seungrokj seungrokj changed the title [DNM][AMD] Add MiniMax-M3-FP8 MI355X ATOMMESH [AMD] Add MiniMax-M3-FP8 MI355X ATOMMESH Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@seungrokj

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@seungrokj seungrokj requested a review from functionstackx June 23, 2026 04:06
@github-actions

Copy link
Copy Markdown
Contributor

@chunfangamd chunfangamd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@billishyahao billishyahao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve conflict. LGTM

@functionstackx

Copy link
Copy Markdown
Collaborator

@chunfangamd can u properly review the PR & follow the checklist? codeowners r entrusted to properly review the PR following the checklist https://github.com/SemiAnalysisAI/InferenceX/blob/main/docs/PR_REVIEW_CHECKLIST.md

@seungrokj

Copy link
Copy Markdown
Collaborator Author

plz ignore these as /reuse-sweep-run didnt work
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28001429207
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28001429207

@functionstackx

Copy link
Copy Markdown
Collaborator

@billishyahao can u properly review the PR & follow the checklist? codeowners r entrusted to properly review the PR following the checklist https://github.com/SemiAnalysisAI/InferenceX/blob/main/docs/PR_REVIEW_CHECKLIST.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all-evals Expand eval selection to every fixed-sequence config AMD full-sweep-enabled

Projects

Development

Successfully merging this pull request may close these issues.

5 participants