This page focuses on the path that is easiest to bring up in practice: distributed simulation on a single host with multiple ranks.
For implementation details and design boundaries, see Distributed Design Notes.
| Goal | Recommended pair | Notes |
|---|---|---|
| Stable first distributed bring-up | simulate + simulate |
Best maintained path |
| Real local compute, simulated communication | hybrid + simulate |
Useful when a local GPU is available |
| Real NCCL collectives with FakeGPU reports | hybrid + proxy |
Comparison-oriented and more experimental |
| Minimal wrapping around real NCCL | passthrough + passthrough |
Not the best first step |
If you only need one answer for where to start, use:
FAKEGPU_MODE=simulate
FAKEGPU_DIST_MODE=simulateMake sure you have built at least:
build/libnccl.so.2build/fakegpu-coordinator./fgpu
Typical build command:
cmake -S . -B build
cmake --build build -j4For torchrun-based validation you also need:
- a Python environment with
torch torchrunavailable onPATH
| Variable | Meaning |
|---|---|
FAKEGPU_MODE |
Compute mode |
FAKEGPU_DIST_MODE |
Distributed mode |
FAKEGPU_CLUSTER_CONFIG |
Cluster YAML path |
FAKEGPU_COORDINATOR_TRANSPORT |
unix or tcp |
FAKEGPU_COORDINATOR_ADDR |
Absolute socket path or host:port |
FAKEGPU_CLUSTER_REPORT_PATH |
Cluster report output path |
FAKEGPU_STAGING_CHUNK_BYTES |
Chunk size threshold for staged transfers |
FAKEGPU_STAGING_FORCE_SOCKET |
Force socket fallback instead of shared memory |
FAKEGPU_DEVICE_COUNT |
Number of exposed fake devices |
The same knobs are available through ./fgpu flags:
./fgpu --mode simulate --dist-mode simulate --cluster-config ... --coordinator-transport unix --coordinator-addr ...version: 1
cluster:
name: dev-cluster
default_backend: nccl
nodes:
- id: node0
host: 127.0.0.1
ranks: [0, 1]
gpus:
- profile: a100
- profile: a100
- id: node1
host: 127.0.0.1
ranks: [2, 3]
gpus:
- profile: h100
- profile: h100
fabric:
intra_node:
type: nvlink
bandwidth_gbps: 300
latency_us: 3
inter_node:
type: infiniband
bandwidth_gbps: 200
latency_us: 15
oversubscription: 1.5Bundled examples live under verification/data/.
Important validation rules:
- ranks should be unique and contiguous
- each node should list the same number of
ranksandgpus FAKEGPU_COORDINATOR_ADDRmust be an absolute path when usingunix
Use the maintained checks first:
python3 verification/test_coordinator_smoke.py
python3 test/test_allreduce_correctness.py
python3 verification/test_allgather_correctness.py
python3 verification/test_group_semantics.py
./test/run_hybrid_multinode.sh 2The maintained checks above validate:
- coordinator lifecycle
- direct collective semantics
- grouped submission semantics
- hybrid compute + simulated communication integration
Recommended order:
python3 verification/test_coordinator_smoke.pypython3 test/test_allreduce_correctness.pypython3 verification/test_allgather_correctness.pypython3 verification/test_group_semantics.py./test/run_multinode_sim.sh 2./test/run_multinode_sim.sh 4./test/run_ddp_multinode.sh 4./test/run_hybrid_multinode.sh 2
The DDP-oriented scripts above are part of the maintained simulate-mode validation set. They provide smoke and control-flow coverage for ProcessGroupNCCL, but they do not imply full numerical or protocol parity with a real NCCL stack.
SOCKET_PATH=/tmp/fakegpu-coordinator.sock
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml
FAKEGPU_DIST_MODE=simulate \
FAKEGPU_CLUSTER_CONFIG="$CLUSTER_CONFIG" \
FAKEGPU_COORDINATOR_TRANSPORT=unix \
FAKEGPU_COORDINATOR_ADDR="$SOCKET_PATH" \
FAKEGPU_CLUSTER_REPORT_PATH=/tmp/fakegpu-cluster-report.json \
./build/fakegpu-coordinator --transport unix --address "$SOCKET_PATH"COORD_ADDR=127.0.0.1:29591
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml
FAKEGPU_DIST_MODE=simulate \
FAKEGPU_CLUSTER_CONFIG="$CLUSTER_CONFIG" \
FAKEGPU_COORDINATOR_TRANSPORT=tcp \
FAKEGPU_COORDINATOR_ADDR="$COORD_ADDR" \
FAKEGPU_CLUSTER_REPORT_PATH=/tmp/fakegpu-cluster-report.json \
./build/fakegpu-coordinator --transport tcp --address "$COORD_ADDR"SOCKET_PATH=/tmp/fakegpu-coordinator.sock
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml
export LD_PRELOAD="$PWD/build/libnccl.so.2${LD_PRELOAD:+:$LD_PRELOAD}"
./fgpu \
--mode simulate \
--dist-mode simulate \
--cluster-config "$CLUSTER_CONFIG" \
--coordinator-transport unix \
--coordinator-addr "$SOCKET_PATH" \
--device-count 4 \
torchrun \
--nnodes=1 \
--nproc_per_node=4 \
--master_addr 127.0.0.1 \
--master_port 29500 \
your_training_script.pyNotes:
./fgpuandpython3 -m fakegpuare equivalent launch styles.--device-countcontrols how many fake devices FakeGPU exposes to the process.torchrunrendezvous settings are separate from the FakeGPU coordinator address.
When distributed mode is enabled and FAKEGPU_CLUSTER_REPORT_PATH is set, FakeGPU writes a cluster report with:
- world-size and coordinator metadata
- per-collective call counts, bytes, and estimated time
- intra-node and inter-node link statistics
- per-rank wait time, timeouts, and communicator-init counts
FAKEGPU_COORDINATOR_ADDRmissing while not inpassthrough- rank or world-size mismatch between runtime environment and cluster config
- forgetting to preload
libnccl.so.2for the distributed path - using proxy or passthrough modes before the basic simulate path is already known to work