Compact experiments for model training, inference export, and distributed-systems baselines on commodity hardware.
train.py: TinyGPT char-level training with AMP, gradient accumulation, and optional DDP.train_fsdp.py: minimal FSDP wrapper entrypoint for sharding strategy experiments.bench_gpu.py: FP16/FP32 matmul and MLP throughput micro-benchmarks.bench_collectives.py: collective communication baseline (all_reduceor local reduction fallback).export_onnx.py: ONNX export path for TensorRT / Triton serving pipelines.infer_ort.py: ONNX Runtime latency benchmark (CPU or CUDA EP).infer_quant.py: dynamic quantization latency comparison on CPU.scheduler_sim.py: simple scheduler policy simulation (FIFO vs greedy packing).
cd gpu-llm-infra-lab
python -m venv .venv
.venv\Scripts\activate
pip install -U pip
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -e .
pip install onnx onnxruntime matplotlibOn every push/PR to main, GitHub Actions runs a CPU smoke pipeline: configs/ci_smoke.yaml train → export_onnx --static → infer_ort. See .github/workflows/ci.yml.
After exporting ONNX, benchmark latency without TensorRT:
python -m gpu_llm_infra_lab.export_onnx --ckpt runs/tinyshakespeare_300/ckpt_final.pt --out artifacts/tiny_gpt.onnx
python -m gpu_llm_infra_lab.infer_ort --onnx artifacts/tiny_gpt.onnx --ckpt runs/tinyshakespeare_300/ckpt_final.pt --seq 128 --steps 100Local reference (Tiny Shakespeare checkpoint, CPUExecutionProvider): ~3.87 ms/run (mean of 100 runs after default warmup; seq=128, batch=1). Your numbers will vary by CPU and ORT build.
Use --cuda if you installed GPU-enabled ONNX Runtime and want CUDAExecutionProvider.
The repository now includes a public corpus download helper:
python -m gpu_llm_infra_lab.fetch_data --dataset tinyshakespeare --out data/tinyshakespeare.txtSource: Karpathy tiny Shakespeare
python -u -m gpu_llm_infra_lab.train --config configs/tinyshakespeare.yaml --max-iters 300 --out_dir runs/tinyshakespeare_300 | Tee-Object -FilePath runs/tinyshakespeare_300/train.logpython -m gpu_llm_infra_lab.plot_training --log runs/tinyshakespeare_300/train.log --out artifacts/train_curve_tinyshakespeare.pngpython -m gpu_llm_infra_lab.export_onnx --ckpt runs/tinyshakespeare_300/ckpt_final.pt --out artifacts/tiny_gpt.onnx
python -m gpu_llm_infra_lab.infer_quant --ckpt runs/tinyshakespeare_300/ckpt_final.pt --steps 100python -m gpu_llm_infra_lab.bench_gpu
python -m gpu_llm_infra_lab.bench_collectives
python -m gpu_llm_infra_lab.scheduler_sim --gpus 4python -u -m gpu_llm_infra_lab.train_fsdp --config configs/default.yaml --max-iters 80 --force-fsdp --out_dir runs/compare_fsdp1Hardware and software (local run):
- GPU: NVIDIA GeForce RTX 2060 (6GB), driver 591.59
- Python: 3.13.8
- CUDA path verified by
bench_gpudevice detection
Observed numbers from current runs:
bench_gpu:- FP16
4096x4096: ~7.49 ms, ~18.36 TFLOP/s - FP32
4096x4096: ~27.13 ms, ~5.07 TFLOP/s - MLP fp32: ~7.18 ms/step, AMP fp16: ~2.67 ms/step
- FP16
train(tinyshakespeare, 300 iters):- loss: 4.31 -> 2.36
- throughput: ~60k to ~70k tokens/s
infer_quant(CPU, 100 steps):- fp32 eager: ~5.73 ms/forward
- dynamic quant: ~7.05 ms/forward
bench_collectivessingle-process baseline:- ~66.8 ms/iter for ~50M float elements local reduction path
train_fsdp --force-fsdp(single-process sharding sanity):- ~47.8k tokens/s at iter 50
- Training curve:
- Triton model-repository layout snapshot:
A minimal Triton model repository is provided under:
deploy/triton/model_repository/tiny_gpt_onnxdeploy/triton/model_repository/tiny_gpt_trt
Use config.pbtxt as templates and place the actual model files in version folder 1/.
If trtexec is installed on your machine:
trtexec --onnx=artifacts/tiny_gpt.onnx --saveEngine=artifacts/tiny_gpt_fp16.plan --fp16 --memPoolSize=workspace:1024Add measured latency from your machine to this README once available.
bench_collectives.py currently records a single-node baseline. For NVLink / PCIe / RDMA comparisons, run the same script across the target environments and append the results in table form (same tensor shape, same iteration count) for fair comparison.
MIT. See LICENSE.

