Skip to content

Add Slime multi-node RL training example#43

Open
xyuzh wants to merge 1 commit intoanyscale:mainfrom
xyuzh:add-slime-multi-node-rl
Open

Add Slime multi-node RL training example#43
xyuzh wants to merge 1 commit intoanyscale:mainfrom
xyuzh:add-slime-multi-node-rl

Conversation

@xyuzh
Copy link
Contributor

@xyuzh xyuzh commented Feb 22, 2026

Summary

  • Adds a complete example for multi-node RL training using Slime (Megatron-LM + SGLang disaggregated rollout)
  • Runs GRPO training of Qwen3-1.7B on 2 workers x 4x A10G (8 GPUs total, TP=2 x PP=2 across nodes)
  • Includes Dockerfile with A10G (SM86) compatibility patches for sgl_kernel and Triton, PyTorch 2.7.1 pinning for pre-built wheel compatibility, and runtime patches for all nodes

Files

File Description
job.yaml Anyscale job config (m5.2xlarge head + 2x g5.12xlarge workers)
Dockerfile.anyscale Docker image with Slime, Megatron-LM, SGLang, flash-attn, TE
anyscale-smoke-2node-a10g.sh Anyscale entrypoint (downloads model/data, converts weights, trains)
patch_all_nodes.py Runtime patches for sgl_kernel on A10G via Ray
run-qwen3-4B-smoke-2node-a10g.sh Bare-metal variant for Qwen3-4B
README.md Documentation with cluster layout, quick start, and OOM tips

Test plan

  • Submit job via anyscale job submit -f job.yaml on a fresh Anyscale workspace
  • Verify Docker image builds successfully (~15 min)
  • Verify weight conversion completes on GPU worker
  • Verify multi-node NCCL init succeeds for PP across workers
  • Verify training loss decreases over rollout-train cycles

🤖 Generated with Claude Code

Multi-node GRPO training of Qwen3-1.7B on 2 workers x 4x A10G using
Slime (Megatron-LM + SGLang disaggregated rollout). Includes Anyscale
job config, Dockerfile with A10G compatibility patches, and entrypoint
that handles model download, weight conversion, and training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant