⚡ LLM-Speed

English | 简体中文 | Docs

LLM-Speed is a compact CUDA kernel playground for LLM inference primitives: FlashAttention forward kernels, Tensor Core GEMM, high-performance GEMM tiling, and a thin PyTorch binding layer. The goal is not to be a giant framework; it is to be a focused, inspectable implementation that is easy to benchmark, study, and extend.

Why this repo is useful

FlashAttention forward path with O(N) memory behavior
Tensor Core GEMM and optimized mixed-precision GEMM baselines
PyTorch-facing bindings for experimentation from Python
Benchmark and correctness harnesses for iteration and regression checks
Readable CUDA code that keeps the optimization ladder visible: naive → tiled → FlashAttention → Tensor Core

What is included

Area	Key files	Notes
Attention kernels	`src/naive_attention.cu`, `src/tiled_attention.cu`, `src/flash_attention.cu`	Correctness baseline through memory-efficient forward attention
GEMM kernels	`src/tensor_core_gemm.cu`, `src/hgemm_kernel.cu`	Tensor Core path plus optimized GEMM
Shared primitives	`include/*.cuh`	Online softmax, shared memory helpers, pipeline, warp ops
Python package	`cuda_llm_ops/`	pybind11 bindings, versioned package surface, profiler utilities
Validation	`tests/`, `benchmarks/`	CPU-safe tests, GPU tests, performance scripts

Requirements

Component	Minimum
Python	3.8
CUDA Toolkit	11.0
PyTorch	2.0
GPU	Volta / SM70 or newer

Supported precision paths in the repository today: FP32, FP16, INT8.

Quick start

git clone https://github.com/LessUp/llm-speed.git
cd llm-speed

python3 -m venv .venv
. .venv/bin/activate

pip install -U pip setuptools wheel
pip install -r requirements.txt pytest hypothesis ruff pre-commit

# Build the extension when CUDA is available
pip install -e .

Smoke-test the user-facing package surface:

python -c "import cuda_llm_ops; print(cuda_llm_ops.__version__)"

Run the repository checks that do not require a GPU:

ruff check cuda_llm_ops/ tests/ benchmarks/
pytest tests/ -v -m "not cuda"
pre-commit run --all-files

Example

import torch
from cuda_llm_ops import flash_attention, tensor_core_gemm

q = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)

attn = flash_attention(q, k, v, is_causal=True)

a = torch.randn(1024, 512, device="cuda", dtype=torch.float16)
b = torch.randn(512, 1024, device="cuda", dtype=torch.float16)
c = tensor_core_gemm(a, b)

Documentation map

Project structure

llm-speed/
├── cuda_llm_ops/        # Python package and bindings
├── src/                 # CUDA kernels
├── include/             # CUDA helpers and primitives
├── tests/               # pytest-based validation
├── benchmarks/          # benchmark scripts
├── docs/                # user-facing docs
├── openspec/            # active specs and change tracking
├── AGENTS.md            # shared AI workflow contract
├── CLAUDE.md            # Claude-specific defaults
└── .github/             # workflows and Copilot instructions

Contributing

This project is governed by OpenSpec. Before changing behavior or interfaces:

/opsx:propose <change-name>
/opsx:apply <change-name>

See CONTRIBUTING.md, AGENTS.md, and CLAUDE.md for the working agreement.

References

Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
NVIDIA CUTLASS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ LLM-Speed

Why this repo is useful

What is included

Requirements

Quick start

Example

Documentation map

Project structure

Contributing

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

⚡ LLM-Speed

Why this repo is useful

What is included

Requirements

Quick start

Example

Documentation map

Project structure

Contributing

References