LLM-Speed is a compact CUDA kernel playground for LLM inference primitives: FlashAttention forward kernels, Tensor Core GEMM, high-performance GEMM tiling, and a thin PyTorch binding layer. The goal is not to be a giant framework; it is to be a focused, inspectable implementation that is easy to benchmark, study, and extend.
- FlashAttention forward path with O(N) memory behavior
- Tensor Core GEMM and optimized mixed-precision GEMM baselines
- PyTorch-facing bindings for experimentation from Python
- Benchmark and correctness harnesses for iteration and regression checks
- Readable CUDA code that keeps the optimization ladder visible: naive → tiled → FlashAttention → Tensor Core
| Area | Key files | Notes |
|---|---|---|
| Attention kernels | src/naive_attention.cu, src/tiled_attention.cu, src/flash_attention.cu |
Correctness baseline through memory-efficient forward attention |
| GEMM kernels | src/tensor_core_gemm.cu, src/hgemm_kernel.cu |
Tensor Core path plus optimized GEMM |
| Shared primitives | include/*.cuh |
Online softmax, shared memory helpers, pipeline, warp ops |
| Python package | cuda_llm_ops/ |
pybind11 bindings, versioned package surface, profiler utilities |
| Validation | tests/, benchmarks/ |
CPU-safe tests, GPU tests, performance scripts |
| Component | Minimum |
|---|---|
| Python | 3.8 |
| CUDA Toolkit | 11.0 |
| PyTorch | 2.0 |
| GPU | Volta / SM70 or newer |
Supported precision paths in the repository today: FP32, FP16, INT8.
git clone https://github.com/LessUp/llm-speed.git
cd llm-speed
python3 -m venv .venv
. .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt pytest hypothesis ruff pre-commit
# Build the extension when CUDA is available
pip install -e .Smoke-test the user-facing package surface:
python -c "import cuda_llm_ops; print(cuda_llm_ops.__version__)"Run the repository checks that do not require a GPU:
ruff check cuda_llm_ops/ tests/ benchmarks/
pytest tests/ -v -m "not cuda"
pre-commit run --all-filesimport torch
from cuda_llm_ops import flash_attention, tensor_core_gemm
q = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
attn = flash_attention(q, k, v, is_causal=True)
a = torch.randn(1024, 512, device="cuda", dtype=torch.float16)
b = torch.randn(512, 1024, device="cuda", dtype=torch.float16)
c = tensor_core_gemm(a, b)llm-speed/
├── cuda_llm_ops/ # Python package and bindings
├── src/ # CUDA kernels
├── include/ # CUDA helpers and primitives
├── tests/ # pytest-based validation
├── benchmarks/ # benchmark scripts
├── docs/ # user-facing docs
├── openspec/ # active specs and change tracking
├── AGENTS.md # shared AI workflow contract
├── CLAUDE.md # Claude-specific defaults
└── .github/ # workflows and Copilot instructions
This project is governed by OpenSpec. Before changing behavior or interfaces:
/opsx:propose <change-name>
/opsx:apply <change-name>See CONTRIBUTING.md, AGENTS.md, and CLAUDE.md for the working agreement.
- Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- NVIDIA CUTLASS