Skip to content

Latest commit

 

History

History
129 lines (95 loc) · 4.61 KB

File metadata and controls

129 lines (95 loc) · 4.61 KB

⚡ LLM-Speed

CI Pages License: MIT Python 3.8+ CUDA 11.0+

English | 简体中文 | Docs

LLM-Speed is a compact CUDA kernel playground for LLM inference primitives: FlashAttention forward kernels, Tensor Core GEMM, high-performance GEMM tiling, and a thin PyTorch binding layer. The goal is not to be a giant framework; it is to be a focused, inspectable implementation that is easy to benchmark, study, and extend.

Why this repo is useful

  • FlashAttention forward path with O(N) memory behavior
  • Tensor Core GEMM and optimized mixed-precision GEMM baselines
  • PyTorch-facing bindings for experimentation from Python
  • Benchmark and correctness harnesses for iteration and regression checks
  • Readable CUDA code that keeps the optimization ladder visible: naive → tiled → FlashAttention → Tensor Core

What is included

Area Key files Notes
Attention kernels src/naive_attention.cu, src/tiled_attention.cu, src/flash_attention.cu Correctness baseline through memory-efficient forward attention
GEMM kernels src/tensor_core_gemm.cu, src/hgemm_kernel.cu Tensor Core path plus optimized GEMM
Shared primitives include/*.cuh Online softmax, shared memory helpers, pipeline, warp ops
Python package cuda_llm_ops/ pybind11 bindings, versioned package surface, profiler utilities
Validation tests/, benchmarks/ CPU-safe tests, GPU tests, performance scripts

Requirements

Component Minimum
Python 3.8
CUDA Toolkit 11.0
PyTorch 2.0
GPU Volta / SM70 or newer

Supported precision paths in the repository today: FP32, FP16, INT8.

Quick start

git clone https://github.com/LessUp/llm-speed.git
cd llm-speed

python3 -m venv .venv
. .venv/bin/activate

pip install -U pip setuptools wheel
pip install -r requirements.txt pytest hypothesis ruff pre-commit

# Build the extension when CUDA is available
pip install -e .

Smoke-test the user-facing package surface:

python -c "import cuda_llm_ops; print(cuda_llm_ops.__version__)"

Run the repository checks that do not require a GPU:

ruff check cuda_llm_ops/ tests/ benchmarks/
pytest tests/ -v -m "not cuda"
pre-commit run --all-files

Example

import torch
from cuda_llm_ops import flash_attention, tensor_core_gemm

q = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)

attn = flash_attention(q, k, v, is_causal=True)

a = torch.randn(1024, 512, device="cuda", dtype=torch.float16)
b = torch.randn(512, 1024, device="cuda", dtype=torch.float16)
c = tensor_core_gemm(a, b)

Documentation map

Project structure

llm-speed/
├── cuda_llm_ops/        # Python package and bindings
├── src/                 # CUDA kernels
├── include/             # CUDA helpers and primitives
├── tests/               # pytest-based validation
├── benchmarks/          # benchmark scripts
├── docs/                # user-facing docs
├── openspec/            # active specs and change tracking
├── AGENTS.md            # shared AI workflow contract
├── CLAUDE.md            # Claude-specific defaults
└── .github/             # workflows and Copilot instructions

Contributing

This project is governed by OpenSpec. Before changing behavior or interfaces:

/opsx:propose <change-name>
/opsx:apply <change-name>

See CONTRIBUTING.md, AGENTS.md, and CLAUDE.md for the working agreement.

References

  1. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
  2. Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
  3. NVIDIA CUTLASS