GPU / LLM Infra Lab

Compact experiments for model training, inference export, and distributed-systems baselines on commodity hardware.

What Is Included

train.py: TinyGPT char-level training with AMP, gradient accumulation, and optional DDP.
train_fsdp.py: minimal FSDP wrapper entrypoint for sharding strategy experiments.
bench_gpu.py: FP16/FP32 matmul and MLP throughput micro-benchmarks.
bench_collectives.py: collective communication baseline (all_reduce or local reduction fallback).
export_onnx.py: ONNX export path for TensorRT / Triton serving pipelines.
infer_ort.py: ONNX Runtime latency benchmark (CPU or CUDA EP).
infer_quant.py: dynamic quantization latency comparison on CPU.
scheduler_sim.py: simple scheduler policy simulation (FIFO vs greedy packing).

Environment

cd gpu-llm-infra-lab
python -m venv .venv
.venv\Scripts\activate
pip install -U pip
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -e .
pip install onnx onnxruntime matplotlib

CI

On every push/PR to main, GitHub Actions runs a CPU smoke pipeline: configs/ci_smoke.yaml train → export_onnx --static → infer_ort. See .github/workflows/ci.yml.

ONNX Runtime inference

After exporting ONNX, benchmark latency without TensorRT:

python -m gpu_llm_infra_lab.export_onnx --ckpt runs/tinyshakespeare_300/ckpt_final.pt --out artifacts/tiny_gpt.onnx
python -m gpu_llm_infra_lab.infer_ort --onnx artifacts/tiny_gpt.onnx --ckpt runs/tinyshakespeare_300/ckpt_final.pt --seq 128 --steps 100

Local reference (Tiny Shakespeare checkpoint, CPUExecutionProvider): ~3.87 ms/run (mean of 100 runs after default warmup; seq=128, batch=1). Your numbers will vary by CPU and ORT build.

Use --cuda if you installed GPU-enabled ONNX Runtime and want CUDAExecutionProvider.

Public Dataset

The repository now includes a public corpus download helper:

python -m gpu_llm_infra_lab.fetch_data --dataset tinyshakespeare --out data/tinyshakespeare.txt

Source: Karpathy tiny Shakespeare

Reproducible Runs

1) Train on Tiny Shakespeare

python -u -m gpu_llm_infra_lab.train --config configs/tinyshakespeare.yaml --max-iters 300 --out_dir runs/tinyshakespeare_300 | Tee-Object -FilePath runs/tinyshakespeare_300/train.log

2) Plot Loss / Throughput Curve

python -m gpu_llm_infra_lab.plot_training --log runs/tinyshakespeare_300/train.log --out artifacts/train_curve_tinyshakespeare.png

3) Export ONNX + Quantization Check

python -m gpu_llm_infra_lab.export_onnx --ckpt runs/tinyshakespeare_300/ckpt_final.pt --out artifacts/tiny_gpt.onnx
python -m gpu_llm_infra_lab.infer_quant --ckpt runs/tinyshakespeare_300/ckpt_final.pt --steps 100

4) GPU / Communication Baselines

python -m gpu_llm_infra_lab.bench_gpu
python -m gpu_llm_infra_lab.bench_collectives
python -m gpu_llm_infra_lab.scheduler_sim --gpus 4

5) FSDP Wrapper (Single-Node Sanity)

python -u -m gpu_llm_infra_lab.train_fsdp --config configs/default.yaml --max-iters 80 --force-fsdp --out_dir runs/compare_fsdp1

Latest Local Results

Hardware and software (local run):

GPU: NVIDIA GeForce RTX 2060 (6GB), driver 591.59
Python: 3.13.8
CUDA path verified by bench_gpu device detection

Observed numbers from current runs:

bench_gpu:
- FP16 4096x4096: ~7.49 ms, ~18.36 TFLOP/s
- FP32 4096x4096: ~27.13 ms, ~5.07 TFLOP/s
- MLP fp32: ~7.18 ms/step, AMP fp16: ~2.67 ms/step
train (tinyshakespeare, 300 iters):
- loss: 4.31 -> 2.36
- throughput: ~60k to ~70k tokens/s
infer_quant (CPU, 100 steps):
- fp32 eager: ~5.73 ms/forward
- dynamic quant: ~7.05 ms/forward
bench_collectives single-process baseline:
- ~66.8 ms/iter for ~50M float elements local reduction path
train_fsdp --force-fsdp (single-process sharding sanity):
- ~47.8k tokens/s at iter 50

Artifacts (Included in Repo)

Training curve:

Triton model-repository layout snapshot:

Triton Repository Skeleton

A minimal Triton model repository is provided under:

deploy/triton/model_repository/tiny_gpt_onnx
deploy/triton/model_repository/tiny_gpt_trt

Use config.pbtxt as templates and place the actual model files in version folder 1/.

TensorRT `trtexec`

If trtexec is installed on your machine:

trtexec --onnx=artifacts/tiny_gpt.onnx --saveEngine=artifacts/tiny_gpt_fp16.plan --fp16 --memPoolSize=workspace:1024

Add measured latency from your machine to this README once available.

Multi-Node Communication Notes

bench_collectives.py currently records a single-node baseline. For NVLink / PCIe / RDMA comparisons, run the same script across the target environments and append the results in table form (same tensor shape, same iteration count) for fair comparison.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
artifacts		artifacts
configs		configs
data		data
deploy/triton/model_repository		deploy/triton/model_repository
src/gpu_llm_infra_lab		src/gpu_llm_infra_lab
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU / LLM Infra Lab

What Is Included

Environment

CI

ONNX Runtime inference

Public Dataset

Reproducible Runs

1) Train on Tiny Shakespeare

2) Plot Loss / Throughput Curve

3) Export ONNX + Quantization Check

4) GPU / Communication Baselines

5) FSDP Wrapper (Single-Node Sanity)

Latest Local Results

Artifacts (Included in Repo)

Triton Repository Skeleton

TensorRT `trtexec`

Multi-Node Communication Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU / LLM Infra Lab

What Is Included

Environment

CI

ONNX Runtime inference

Public Dataset

Reproducible Runs

1) Train on Tiny Shakespeare

2) Plot Loss / Throughput Curve

3) Export ONNX + Quantization Check

4) GPU / Communication Baselines

5) FSDP Wrapper (Single-Node Sanity)

Latest Local Results

Artifacts (Included in Repo)

Triton Repository Skeleton

TensorRT trtexec

Multi-Node Communication Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

TensorRT `trtexec`

Packages