Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
ea33910
[AMD] server_atom: improve config print and cleanup
seungrokj Jun 19, 2026
027f3f1
update perf-changelog for dsv4-fp4-mi355x-atom-disagg-mtp
seungrokj Jun 19, 2026
2216d11
[AMD] fix DECODE_MTP_SIZE and BENCH_REQUEST_RATE propagation in atom-…
seungrokj Jun 19, 2026
cd745fa
[AMD] server_atom: pass SPEC_ARGS to prefill server
seungrokj Jun 19, 2026
baf0e06
[AMD] amd-master: fix comment for 1P1D TP8+DPA+TBO+MTP1 config
seungrokj Jun 19, 2026
1485744
[AMD] dsv4_atom-disagg: remove DECODE_MTP_SIZE from check_env_vars
seungrokj Jun 19, 2026
4e039bc
[AMD] bench: use --dsv4 flag for DeepSeek-V4-Pro MTP benchmarks
seungrokj Jun 19, 2026
0868467
[AMD] server_atom: export IS_MTP=true when SPEC_DECODING=mtp for benc…
seungrokj Jun 19, 2026
c7d48b0
[AMD] server_atom: fix hf-overrides JSON quoting
seungrokj Jun 19, 2026
39e62eb
update perf-changelog for minimaxm3-fp4-mi355x-atom
seungrokj Jun 19, 2026
ba37d04
update perf-changelog for dsv4-fp4-mi355x-atom-disagg-mtp
seungrokj Jun 19, 2026
eb7179f
Merge branch 'main' into amd/atom_mesh_0619_mtp
seungrokj Jun 19, 2026
6893a06
fix: inline --hf-overrides to avoid eval word-splitting, remove OPT_ARGS
seungrokj Jun 19, 2026
5106002
refactor: extract --hf-overrides into HF_OVERRIDES_ARG variable
seungrokj Jun 19, 2026
55c810d
fix: enable --hf-overrides only for DeepSeek-V4-Pro
seungrokj Jun 19, 2026
6386657
fix: add HF_OVERRIDES_ARG to INFO config print block
seungrokj Jun 19, 2026
92746e9
fix: replace broken-quote array splice with ${ARRAY[*]} in CMD strings
seungrokj Jun 19, 2026
97f0cab
fix: remove ${CUDAGRAPH_OPT} from decode CMD
seungrokj Jun 19, 2026
f9a93c4
feat: add 2P1D DPA+MTP3 search space to dsv4-fp4-mi355x-atom-disagg-m…
seungrokj Jun 19, 2026
4d4fe2b
Merge branch 'main' into amd/atom_mesh_0619_mtp
seungrokj Jun 19, 2026
8b4a94c
Merge branch 'main' into amd/atom_mesh_0619_mtp
seungrokj Jun 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2521,6 +2521,114 @@ dsv4-fp4-mi355x-atom-disagg:
additional-settings:
- "DECODE_NODES=1"

dsv4-fp4-mi355x-atom-disagg-mtp:
image: rocm/atom-dev:nightly_202606181332
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz use runner mi355x-disagg

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong CI runner for disagg

Medium Severity

The new multinode disaggregated recipe sets runner: mi355x, but disagg benchmarks on MI355X are scheduled on the mi355x-disagg pool per runners.yaml and other disagg recipes; PR review also requested mi355x-disagg.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8b4a94c. Configure here.

precision: fp4
framework: atom-disagg
multinode: true
disagg: true
scenarios:
fixed-seq-len:
- isl: 8192
osl: 1024
search-space:
# 2P1D TP8+DPA+TBO+MTP1
- spec-decoding: "mtp"
conc-list: [ 256, 512, 768, 1024, 2048 ]
prefill:
num-worker: 2
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "PREFILL_NODES=2"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=1"
# 2P1D TP8+DPA+TBO+MTP3
- spec-decoding: "mtp"
conc-list: [ 256, 512, 768, 1024, 2048 ]
prefill:
num-worker: 2
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "PREFILL_NODES=2"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=3"
# 1P1D TP8+MTP3
- spec-decoding: "mtp"
conc-list: [ 1, 2, 4, 8, 16, 32, 64, 128, 256 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=3"
# 1P1D TP8+DPA+TBO+MTP1
- isl: 1024

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ISL8192 DPA MTP sweep

Medium Severity

The 1P1D TP8+DPA+TBO+MTP1 comment at ISL8192 is not followed by a search-space entry, so that configuration never runs. The ISL8192 block ends after 1P1D TP8+MTP3, unlike the ISL1024 block where the matching DPA+MTP1 sweep is defined.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8b4a94c. Configure here.

osl: 1024
search-space:
- spec-decoding: "mtp"
conc-list: [ 64, 128, 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=1"
# 1P1D TP8+MTP3
- spec-decoding: "mtp"
conc-list: [ 1, 2, 4, 8, 16, 32, 64, 128, 256 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=3"

# MiniMax-M3 MXFP8 MI355X recipe:
# https://github.com/vllm-project/recipes/commit/2a3728ed9892debfd767a72a58ebc90b33f186e5
# MXFP8 runs from TP=4 on gfx950; block size 128 is mandatory for MSA.
Expand Down
6 changes: 5 additions & 1 deletion benchmarks/multi_node/amd_utils/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,11 @@ for max_concurrency in "${chosen_concurrencies[@]}"; do
extra_flags="--trust-remote-code --tokenizer $MODEL_PATH"
else
if [ "$IS_MTP" = "true" ]; then
extra_flags="--use-chat-template"
if [[ "$MODEL_NAME" == "DeepSeek-V4-Pro" ]]; then
extra_flags="--dsv4"
else
extra_flags="--use-chat-template"
fi
fi
fi

Expand Down
3 changes: 2 additions & 1 deletion benchmarks/multi_node/amd_utils/job.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,7 @@ DOCKER_ENV_COMMON=(
-e BENCH_RANDOM_RANGE_RATIO=\$BENCH_RANDOM_RANGE_RATIO
-e BENCH_NUM_PROMPTS_MULTIPLIER=\$BENCH_NUM_PROMPTS_MULTIPLIER
-e BENCH_MAX_CONCURRENCY=\$BENCH_MAX_CONCURRENCY
-e BENCH_REQUEST_RATE=\$BENCH_REQUEST_RATE
-e TQDM_MININTERVAL=\$TQDM_MININTERVAL
-e DRY_RUN=\$DRY_RUN
-e BENCHMARK_LOGS_DIR=/benchmark_logs
Expand Down Expand Up @@ -411,7 +412,7 @@ elif [[ "$ENGINE" == "atom-disagg" ]]; then
-e DECODE_PORT=${DECODE_PORT:-8020}
-e ROUTER_PORT=${ROUTER_PORT:-30000}
-e HANDSHAKE_PORT=${HANDSHAKE_PORT:-6301}
-e MEM_FRACTION=${MEM_FRACTION:-0.85}
-e MEM_FRAC_STATIC=${MEM_FRAC_STATIC:-0.85}
-e KV_CACHE_DTYPE=${KV_CACHE_DTYPE:-fp8}
-e BLOCK_SIZE=${BLOCK_SIZE:-16}
-e MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
Expand Down
78 changes: 63 additions & 15 deletions benchmarks/multi_node/amd_utils/server_atom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,18 @@ DECODE_TP_SIZE="${DECODE_TP_SIZE:-8}"
DECODE_ENABLE_EP="${DECODE_ENABLE_EP}"
DECODE_ENABLE_DP="${DECODE_ENABLE_DP}"

# MTP
SPEC_DECODING="${SPEC_DECODING:-}"
DECODE_MTP_SIZE="${DECODE_MTP_SIZE:-1}"

# ATOM server ports (different from SGLang which uses 8000 for all)
PREFILL_PORT="${PREFILL_PORT:-8010}"
DECODE_PORT="${DECODE_PORT:-8020}"
ROUTER_PORT="${ROUTER_PORT:-8000}"
HANDSHAKE_PORT="${HANDSHAKE_PORT:-6301}"

# ATOM server tuning (from reference script defaults)
MEM_FRACTION="${MEM_FRACTION:-0.85}"
MEM_FRAC_STATIC="${MEM_FRAC_STATIC:-0.85}"
KV_CACHE_DTYPE="${KV_CACHE_DTYPE:-fp8}"
BLOCK_SIZE="${BLOCK_SIZE:-16}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-256}"
Expand Down Expand Up @@ -100,34 +104,67 @@ for i in $(seq 0 $((yD - 1))); do
DECODE_ARGS="$DECODE_ARGS --decode http://${IP_ARRAY[$idx]}:${DECODE_PORT}"
done

echo "Prefill IPs : ${PREFILL_IPS[*]}"
echo "Decode IPs : ${DECODE_IPS[*]}"

PREFILL_ENABLE_EP="${PREFILL_ENABLE_EP}"
PREFILL_ENABLE_DP="${PREFILL_ENABLE_DP}"
DECODE_ENABLE_EP="${DECODE_ENABLE_EP}"
DECODE_ENABLE_DP="${DECODE_ENABLE_DP}"

# Parallel args
PREFILL_PARALLEL_ARGS=(-tp "$PREFILL_TP_SIZE") #TP
if [ "$PREFILL_ENABLE_DP" = "true" ]; then
if [ "$PREFILL_ENABLE_EP" -gt 1 ]; then #DPA+EP
PREFILL_PARALLEL_ARGS=(-tp "$PREFILL_TP_SIZE" --enable-expert-parallel --enable-dp-attention )
else #DPA+TP
PREFILL_PARALLEL_ARGS=(-tp "$PREFILL_TP_SIZE" --enable-dp-attention )
else #TP+DPA+TBO
# (srok), TBO only on Prefill server
PREFILL_PARALLEL_ARGS=(-tp "$PREFILL_TP_SIZE" --enable-dp-attention --enable-tbo )
export GPU_MAX_HW_QUEUES=5
export ATOM_CPU_AFFINITY=1
fi
fi

DECODE_PARALLEL_ARGS=(-tp "$PREFILL_TP_SIZE") #TP
if [ "$DECODE_ENABLE_DP" = "true" ]; then
if [ "$DECODE_ENABLE_EP" -gt 1 ]; then #DPA+EP
DECODE_PARALLEL_ARGS=(-tp "$DECODE_TP_SIZE" --enable-expert-parallel --enable-dp-attention )
else #DPA+TP
else #TP+DPA+TBO
DECODE_PARALLEL_ARGS=(-tp "$DECODE_TP_SIZE" --enable-dp-attention )
export GPU_MAX_HW_QUEUES=5
export ATOM_CPU_AFFINITY=1
Comment thread
seungrokj marked this conversation as resolved.
fi
fi

echo "Prefill Parallel args : ${PREFILL_PARALLEL_ARGS[*]}"
echo "Decode Parallel args : ${DECODE_PARALLEL_ARGS[*]}"
# MTP args
SPEC_ARGS=() #TP
if [ "$SPEC_DECODING" = "mtp" ]; then
SPEC_ARGS=(--method mtp --num-speculative-tokens "$DECODE_MTP_SIZE")
Comment thread
seungrokj marked this conversation as resolved.
fi

# HF overrides (single-quoted JSON preserved through eval)
HF_OVERRIDES_ARG=""
if [[ "$MODEL_NAME" == "DeepSeek-V4-Pro" ]]; then
HF_OVERRIDES_ARG="--hf-overrides '{\"use_index_cache\":true,\"index_topk_freq\":4}'"
fi

cat <<INFO
=== Configuration ===
PREFILL : ${PREFILL_IPS[*]} (TP=${PREFILL_TP_SIZE}, EP=${PREFILL_ENABLE_EP:-false}, DP=${PREFILL_ENABLE_DP:-false}, port=${PREFILL_PORT})
DECODE : ${DECODE_IPS[*]} (TP=${DECODE_TP_SIZE}, EP=${DECODE_ENABLE_EP:-false}, DP=${DECODE_ENABLE_DP:-false}, port=${DECODE_PORT})
ROUTER : port=${ROUTER_PORT}
MODEL : ${MODEL_NAME}
BACKEND : atom (PD mooncake KV transfer)
MTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}
xP/yD : ${xP} / ${yD}
KV cache : dtype=${KV_CACHE_DTYPE} block_size=${BLOCK_SIZE} mem_frac=${MEM_FRAC_STATIC}
Prefill args : ${PREFILL_PARALLEL_ARGS[*]}
Decode args : ${DECODE_PARALLEL_ARGS[*]}
Spec args : ${SPEC_ARGS[*]}
Opt args : ${HF_OVERRIDES_ARG}
=====================
INFO

echo "=== Environment Variables ==="
printenv | sort
echo "============================="

# =============================================================================
# Node Role Assignment
Expand All @@ -153,12 +190,14 @@ if [ "$NODE_RANK" -eq 0 ]; then
--model ${MODEL_DIR}/${MODEL_NAME} \
--host 0.0.0.0 --server-port ${PREFILL_PORT} \
--trust-remote-code \
"${PREFILL_PARALLEL_ARGS[@]}" \
${PREFILL_PARALLEL_ARGS[*]} \
${SPEC_ARGS[*]} \
--kv_cache_dtype ${KV_CACHE_DTYPE} \
--block-size ${BLOCK_SIZE} \
--gpu-memory-utilization ${MEM_FRACTION} \
--gpu-memory-utilization ${MEM_FRAC_STATIC} \
--max-num-seqs ${MAX_NUM_SEQS} \
--no-enable_prefix_caching \
${HF_OVERRIDES_ARG} \
--kv-transfer-config '{\"kv_role\":\"kv_producer\",\"kv_connector\":\"mooncake\",\"proxy_ip\":\"${host_ip}\",\"handshake_port\":${HANDSHAKE_PORT}}' \
${EXTRA_SERVER_ARGS}"

Expand Down Expand Up @@ -248,6 +287,11 @@ if [ "$NODE_RANK" -eq 0 ]; then

cd $ATOM_WS_PATH

export IS_MTP="false"
if [ "$SPEC_DECODING" = "mtp" ]; then
export IS_MTP="true"
fi

BENCH_CMD="bash $ATOM_WS_PATH/bench.sh ${xP} ${yD} $((PREFILL_TP_SIZE*xP)) $((DECODE_TP_SIZE*yD)) \
$MODEL_DIR $MODEL_NAME /run_logs/slurm_job-${SLURM_JOB_ID} ${BENCH_INPUT_LEN} \
${BENCH_OUTPUT_LEN} \"${BENCH_MAX_CONCURRENCY}\" ${BENCH_REQUEST_RATE} \
Expand Down Expand Up @@ -367,12 +411,14 @@ elif [ "$NODE_RANK" -gt 0 ] && [ "$NODE_RANK" -lt "$NODE_OFFSET" ]; then
--model ${MODEL_DIR}/${MODEL_NAME} \
--host 0.0.0.0 --server-port ${PREFILL_PORT} \
--trust-remote-code \
"${PREFILL_PARALLEL_ARGS[@]}" \
${PREFILL_PARALLEL_ARGS[*]} \
${SPEC_ARGS[*]} \
--kv_cache_dtype ${KV_CACHE_DTYPE} \
--block-size ${BLOCK_SIZE} \
--gpu-memory-utilization ${MEM_FRACTION} \
--gpu-memory-utilization ${MEM_FRAC_STATIC} \
--max-num-seqs ${MAX_NUM_SEQS} \
--no-enable_prefix_caching \
${HF_OVERRIDES_ARG} \
--kv-transfer-config '{\"kv_role\":\"kv_producer\",\"kv_connector\":\"mooncake\",\"proxy_ip\":\"${host_ip}\",\"handshake_port\":${HANDSHAKE_PORT}}' \
${EXTRA_SERVER_ARGS}"

Expand Down Expand Up @@ -449,12 +495,14 @@ else
--model ${MODEL_DIR}/${MODEL_NAME} \
--host 0.0.0.0 --server-port ${DECODE_PORT} \
--trust-remote-code \
"${DECODE_PARALLEL_ARGS[@]}" \
${DECODE_PARALLEL_ARGS[*]} \
${SPEC_ARGS[*]} \
--kv_cache_dtype ${KV_CACHE_DTYPE} \
--block-size ${BLOCK_SIZE} \
--gpu-memory-utilization ${MEM_FRACTION} \
--gpu-memory-utilization ${MEM_FRAC_STATIC} \
--max-num-seqs ${DECODE_MAX_NUM_SEQS} \
--no-enable_prefix_caching \
${HF_OVERRIDES_ARG} \
--kv-transfer-config '{\"kv_role\":\"kv_consumer\",\"kv_connector\":\"mooncake\",\"proxy_ip\":\"${host_ip}\",\"handshake_port\":${HANDSHAKE_PORT}}' \
--cudagraph-capture-sizes "${CUDAGRAPH_SIZES}" \
${EXTRA_SERVER_ARGS}"
Expand Down
3 changes: 3 additions & 0 deletions benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ else
export DECODE_ENABLE_DP=false
fi

export SPEC_DECODING="${SPEC_DECODING}"
export DECODE_MTP_SIZE="${DECODE_MTP_SIZE:-0}"

# Launch jobs based on ISL/OSL
# Replace ' ' in CONC_LIST with 'x' such that the concurrency list is represented
# by a list of numbers delimited by 'x'. This is because of how the underlying launch script
Expand Down
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3964,6 +3964,16 @@
- "Remove the runtime SupportsEagle3 source patch now included in the pinned nightly"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1843

- config-keys:
- dsv4-fp4-mi355x-atom-disagg-mtp
description:
- "Add dsv4-fp4-mi355x-atom-disagg-mtp recipe: multi-node disaggregated PD on MI355X via ATOM with MTP speculative decoding"
- "2P1D DPA+TBO+MTP1 sweep at ISL8192 (conc 256-2048)"
- "1P1D TP8+MTP3 sweep at ISL8192 (conc 4-128)"
- "1P1D TP8+DPA+MTP1 sweep at ISL1024 (conc 64-1024)"
- "Image: rocm/atom-dev:nightly_202606181332"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1855

- config-keys:
- minimaxm3-fp8-gb300-dynamo-vllm
description:
Expand Down