Distributed Simulation Usage

This page focuses on the path that is easiest to bring up in practice: distributed simulation on a single host with multiple ranks.

For implementation details and design boundaries, see Distributed Design Notes.

Recommended mode pairs

Goal	Recommended pair	Notes
Stable first distributed bring-up	`simulate + simulate`	Best maintained path
Real local compute, simulated communication	`hybrid + simulate`	Useful when a local GPU is available
Real NCCL collectives with FakeGPU reports	`hybrid + proxy`	Comparison-oriented and more experimental
Minimal wrapping around real NCCL	`passthrough + passthrough`	Not the best first step

If you only need one answer for where to start, use:

FAKEGPU_MODE=simulate
FAKEGPU_DIST_MODE=simulate

Prerequisites

Make sure you have built at least:

build/libnccl.so.2
build/fakegpu-coordinator
./fgpu

Typical build command:

cmake -S . -B build
cmake --build build -j4

For torchrun-based validation you also need:

a Python environment with torch
torchrun available on PATH

Important settings

Variable	Meaning
`FAKEGPU_MODE`	Compute mode
`FAKEGPU_DIST_MODE`	Distributed mode
`FAKEGPU_CLUSTER_CONFIG`	Cluster YAML path
`FAKEGPU_COORDINATOR_TRANSPORT`	`unix` or `tcp`
`FAKEGPU_COORDINATOR_ADDR`	Absolute socket path or `host:port`
`FAKEGPU_CLUSTER_REPORT_PATH`	Cluster report output path
`FAKEGPU_STAGING_CHUNK_BYTES`	Chunk size threshold for staged transfers
`FAKEGPU_STAGING_FORCE_SOCKET`	Force socket fallback instead of shared memory
`FAKEGPU_DEVICE_COUNT`	Number of exposed fake devices

The same knobs are available through ./fgpu flags:

./fgpu --mode simulate --dist-mode simulate --cluster-config ... --coordinator-transport unix --coordinator-addr ...

Minimal cluster config

version: 1
cluster:
  name: dev-cluster
  default_backend: nccl

nodes:
  - id: node0
    host: 127.0.0.1
    ranks: [0, 1]
    gpus:
      - profile: a100
      - profile: a100

  - id: node1
    host: 127.0.0.1
    ranks: [2, 3]
    gpus:
      - profile: h100
      - profile: h100

fabric:
  intra_node:
    type: nvlink
    bandwidth_gbps: 300
    latency_us: 3

  inter_node:
    type: infiniband
    bandwidth_gbps: 200
    latency_us: 15
    oversubscription: 1.5

Bundled examples live under verification/data/.

Important validation rules:

ranks should be unique and contiguous
each node should list the same number of ranks and gpus
FAKEGPU_COORDINATOR_ADDR must be an absolute path when using unix

Fastest way to verify the path

Use the maintained checks first:

python3 verification/test_coordinator_smoke.py
python3 test/test_allreduce_correctness.py
python3 verification/test_allgather_correctness.py
python3 verification/test_group_semantics.py
./test/run_hybrid_multinode.sh 2

The maintained checks above validate:

coordinator lifecycle
direct collective semantics
grouped submission semantics
hybrid compute + simulated communication integration

Recommended order:

python3 verification/test_coordinator_smoke.py
python3 test/test_allreduce_correctness.py
python3 verification/test_allgather_correctness.py
python3 verification/test_group_semantics.py
./test/run_multinode_sim.sh 2
./test/run_multinode_sim.sh 4
./test/run_ddp_multinode.sh 4
./test/run_hybrid_multinode.sh 2

The DDP-oriented scripts above are part of the maintained simulate-mode validation set. They provide smoke and control-flow coverage for ProcessGroupNCCL, but they do not imply full numerical or protocol parity with a real NCCL stack.

Manual coordinator startup

Unix socket example

SOCKET_PATH=/tmp/fakegpu-coordinator.sock
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml

FAKEGPU_DIST_MODE=simulate \
FAKEGPU_CLUSTER_CONFIG="$CLUSTER_CONFIG" \
FAKEGPU_COORDINATOR_TRANSPORT=unix \
FAKEGPU_COORDINATOR_ADDR="$SOCKET_PATH" \
FAKEGPU_CLUSTER_REPORT_PATH=/tmp/fakegpu-cluster-report.json \
./build/fakegpu-coordinator --transport unix --address "$SOCKET_PATH"

TCP example

COORD_ADDR=127.0.0.1:29591
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml

FAKEGPU_DIST_MODE=simulate \
FAKEGPU_CLUSTER_CONFIG="$CLUSTER_CONFIG" \
FAKEGPU_COORDINATOR_TRANSPORT=tcp \
FAKEGPU_COORDINATOR_ADDR="$COORD_ADDR" \
FAKEGPU_CLUSTER_REPORT_PATH=/tmp/fakegpu-cluster-report.json \
./build/fakegpu-coordinator --transport tcp --address "$COORD_ADDR"

Manual `torchrun` template

SOCKET_PATH=/tmp/fakegpu-coordinator.sock
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml

export LD_PRELOAD="$PWD/build/libnccl.so.2${LD_PRELOAD:+:$LD_PRELOAD}"

./fgpu \
  --mode simulate \
  --dist-mode simulate \
  --cluster-config "$CLUSTER_CONFIG" \
  --coordinator-transport unix \
  --coordinator-addr "$SOCKET_PATH" \
  --device-count 4 \
  torchrun \
  --nnodes=1 \
  --nproc_per_node=4 \
  --master_addr 127.0.0.1 \
  --master_port 29500 \
  your_training_script.py

Notes:

./fgpu and python3 -m fakegpu are equivalent launch styles.
--device-count controls how many fake devices FakeGPU exposes to the process.
torchrun rendezvous settings are separate from the FakeGPU coordinator address.

Reading the output

When distributed mode is enabled and FAKEGPU_CLUSTER_REPORT_PATH is set, FakeGPU writes a cluster report with:

world-size and coordinator metadata
per-collective call counts, bytes, and estimated time
intra-node and inter-node link statistics
per-rank wait time, timeouts, and communicator-init counts

Common failure cases

FAKEGPU_COORDINATOR_ADDR missing while not in passthrough
rank or world-size mismatch between runtime environment and cluster config
forgetting to preload libnccl.so.2 for the distributed path
using proxy or passthrough modes before the basic simulate path is already known to work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Simulation Usage

Recommended mode pairs

Prerequisites

Important settings

Minimal cluster config

Fastest way to verify the path

Manual coordinator startup

Unix socket example

TCP example

Manual `torchrun` template

Reading the output

Common failure cases

FilesExpand file tree

distributed-sim-usage.md

Latest commit

History

distributed-sim-usage.md

File metadata and controls

Distributed Simulation Usage

Recommended mode pairs

Prerequisites

Important settings

Minimal cluster config

Fastest way to verify the path

Manual coordinator startup

Unix socket example

TCP example

Manual torchrun template

Reading the output

Common failure cases

Manual `torchrun` template