HPC-AI-Optimization-Lab is an educational and production-ready CUDA kernel library designed for AI inference workloads. It provides step-by-step optimized implementations of critical GPU operations, from basic elementwise operations to advanced Tensor Core matrix multiplication.
| Feature | HPC-AI-Lab | cuBLAS | CUTLASS |
|---|---|---|---|
| Learning Focus | โ Progressive optimization | โ Black box | |
| Production Ready | โ Tested & benchmarked | โ Highly optimized | โ Optimized |
| Easy to Use | โ Simple API + Python | โ API | |
| Educational | โ 7-step GEMM journey | โ No | |
| Modern AI | โ FlashAttention, RoPE, FP8 | โ Yes | โ Yes |
Perfect for:
- ๐ Students: Learn CUDA optimization from first principles
- ๐ฌ Researchers: Prototype new kernel optimizations
- ๐ญ Engineers: Production-ready kernels for AI workloads
# Clone, build, and test
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . -j$(nproc)
ctest --output-on-failure| Requirement | Version | Notes |
|---|---|---|
| CUDA Toolkit | 12.4+ | Download |
| CMake | 3.24+ | pip install cmake or system package |
| C++ Compiler | GCC 11+ / Clang 14+ | C++20 support required |
| NVIDIA GPU | Compute Capability 7.0+ | Volta, Turing, Ampere, Hopper |
# Basic build (core library only)
cmake .. -DCMAKE_BUILD_TYPE=Release
# With examples and Python bindings
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DBUILD_EXAMPLES=ON \
-DBUILD_PYTHON_BINDINGS=ON
# Target specific GPU architectures
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90" # A100 + H100# ReLU example (elementwise operation)
./examples/elementwise/relu_example
# GEMM benchmark (all 7 optimization steps)
./examples/gemm/gemm_benchmark
# Python usage (if bindings enabled)
python examples/python/basic_usage.py| Step | Technique | Performance | Speedup |
|---|---|---|---|
| 1 | Naive | 0.5 TFLOPS | 1ร (baseline) |
| 2 | Shared Memory Tiling | 2.0 TFLOPS | 4ร |
| 3 | Double Buffering | 3.5 TFLOPS | 7ร |
| 4 | Register Tiling | 6.0 TFLOPS | 12ร |
| 5 | Tensor Core WMMA | 50+ TFLOPS | 100ร |
| 6 | Tensor Core MMA PTX | 60+ TFLOPS | 120ร |
| 7 | Software Pipelining | 70+ TFLOPS | 140ร |
๐ก Key Insight: Tensor Core acceleration provides 100ร speedup over naive implementation!
| Module | Operations | FP32 Perf | Status |
|---|---|---|---|
| Elementwise | ReLU, Sigmoid, Transpose | Memory-bound | โ Stable |
| Reduction | Softmax, LayerNorm, RMSNorm | Optimized | โ Stable |
| GEMM | Matrix multiplication | 70+ TFLOPS | โ Stable |
| Attention | FlashAttention, RoPE | IO-aware | โ Stable |
| Convolution | Implicit GEMM | Competitive | โ Stable |
Visit our comprehensive documentation at: https://lessup.github.io/hpc-ai-optimization-lab/
| Topic | English | ไธญๆ |
|---|---|---|
| Getting Started | Installation | ๅฎ่ฃ ๆๅ |
| Quick Start | 5-min Guide | ๅฟซ้ๅ ฅ้จ |
| GEMM Optimization | 7-Step Journey | GEMMไผๅ |
| Memory Optimization | Guide | ่ฎฟๅญไผๅ |
| FlashAttention | Guide | FlashAttention |
| Performance Tuning | Guide | ๆง่ฝ่ฐไผ |
| API Reference | C++/Python API | APIๅ่ |
๐ฑ Beginner (1-2 weeks)
โโโ Installation & Quick Start
โโโ Memory Optimization (coalesced access, vectorization)
โโโ Reduction Operations (warp shuffle, online algorithms)
โโโ GEMM Steps 1-4 (shared memory to register tiling)
๐ Intermediate (2-4 weeks)
โโโ GEMM Steps 5-7 (Tensor Core WMMA, MMA PTX, pipelining)
โโโ FlashAttention (IO-aware attention)
โโโ Profiling & Performance Tuning
๐ Advanced (ongoing)
โโโ CUDA 13 Hopper Features (TMA, Clusters, FP8)
โโโ CUTLASS Source Code Study
โโโ Research Paper Implementations
hpc-ai-optimization-lab/
โโโ src/ # CUDA kernel implementations
โ โโโ common/ # Shared utilities (Tensor, Timer, CUDA checks)
โ โโโ elementwise/ # ReLU, Sigmoid, VectorAdd, Transpose
โ โโโ reduction/ # Softmax, LayerNorm, RMSNorm
โ โโโ gemm/ # 7-step GEMM optimization (flagship!)
โ โโโ convolution/ # Implicit GEMM, Winograd
โ โโโ attention/ # FlashAttention, RoPE, TopK
โ โโโ quantization/ # INT8/FP8 quantization
โ โโโ cuda13/ # Hopper features (TMA, Clusters, FP8)
โ
โโโ tests/ # Comprehensive test suite
โ โโโ common/ # Utility tests
โ โโโ elementwise/ # Elementwise tests
โ โโโ gemm/ # GEMM tests (property-based)
โ โโโ ... # All modules tested
โ
โโโ examples/ # Standalone examples
โ โโโ elementwise/ # ReLU example
โ โโโ reduction/ # Softmax benchmark
โ โโโ gemm/ # GEMM benchmark
โ โโโ convolution/ # Conv example
โ โโโ attention/ # FlashAttention example
โ โโโ quantization/ # Quantization example
โ โโโ cuda13/ # CUDA 13 example
โ โโโ python/ # Python usage examples
โ
โโโ python/ # Python bindings (nanobind)
โ โโโ bindings/ # C++ binding code
โ โโโ benchmark/ # Python benchmarks
โ
โโโ docs/ # Documentation (VitePress + Doxygen)
โ โโโ en/ # English documentation
โ โโโ zh-CN/ # Chinese documentation
โ โโโ .vitepress/ # VitePress configuration
โ
โโโ docker/ # Docker environment
โ โโโ Dockerfile
โ โโโ docker-compose.yml
โ
โโโ .github/ # CI/CD workflows
โโโ workflows/
โโโ ci.yml # Continuous Integration
โโโ pages.yml # Documentation deployment
#include "gemm/gemm.cuh"
#include "common/tensor.cuh"
// Allocate GPU tensors
auto A = hpc::common::make_tensor<float>(hpc::common::Device, {M, K});
auto B = hpc::common::make_tensor<float>(hpc::common::Device, {K, N});
auto C = hpc::common::make_tensor<float>(hpc::common::Device, {M, N});
// Launch optimized GEMM kernel
hpc::gemm::gemm<float, hpc::gemm::OptLevel::Advanced>(
A.data(), B.data(), C.data(), M, N, K, stream);import hpc_ai_opt
import numpy as np
# Create input data
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)
# Execute optimized GEMM
C = hpc_ai_opt.gemm(A, B)
print(f"Result shape: {C.shape}")
print(f"Performance: {hpc_ai_opt.last_tflops:.1f} TFLOPS")Unit Tests (GoogleTest)
# Run all tests
ctest --output-on-failure
# Run specific test suite
./tests/gemm/test_gemmProperty-Based Tests (RapidCheck)
- Automatically generates edge cases
- Tests all input size combinations
- Finds numerical stability issues
| Module | Unit Tests | Property Tests | Coverage |
|---|---|---|---|
| Elementwise | 12 | 48 | 95%+ |
| Reduction | 9 | 36 | 90%+ |
| GEMM | 15 | 60 | 98%+ |
| Attention | 8 | 32 | 92%+ |
| Total | 60+ | 200+ | 95%+ |
Use our pre-configured Docker environment for hassle-free development:
# Start development environment
cd docker && docker-compose up -d
docker exec -it hpc-ai-lab bash
# Inside container: everything is pre-installed!
cmake -S . -B build && cmake --build build -j$(nproc)
ctest --test-dir buildWe welcome contributions! This project follows Spec-Driven Development (SDD).
CI Scope Note: This repository does not currently provide full native CUDA build-and-test coverage in CI. The CI pipeline focuses on code formatting, consistency checks, and documentation builds. GPU-dependent tests require local execution or self-hosted runners.
# 1. Fork and clone
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab
# 2. Create feature branch
git checkout -b feature/my-optimization
# 3. Make changes and add tests
# Follow specs/ directory for requirements
# 4. Ensure tests pass
cmake -S . -B build && cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure
# 5. Commit and push
git commit -m "feat: optimize GEMM step 3"
git push origin feature/my-optimization
โ ๏ธ Note: Current CI focuses on code formatting, consistency, and documentation. GPU tests require local execution or self-hosted runners.
See CONTRIBUTING.md for detailed guidelines.
- Elementwise operations (4 kernels)
- Reduction operations (3 kernels)
- GEMM optimization (7 steps)
- FlashAttention + RoPE + TopK
- INT8/FP8 quantization
- CUDA 13 Hopper features
- Python bindings (nanobind)
- Comprehensive documentation
- FP8 GEMM (Hopper native)
- Multi-GPU support
- CUTLASS integration
- Performance regression tests
- MoE (Mixture of Experts) support
- Sparse GEMM optimization
- Auto-tuning framework
- PyTorch integration
| Module | FP32 | FP16 | BF16 | INT8 | FP8 | Status |
|---|---|---|---|---|---|---|
| Elementwise | โ | โ | โ | - | - | Stable |
| Reduction | โ | โ | โ | - | - | Stable |
| GEMM | โ | โ | โ | โ | ๐ง | Stable |
| Convolution | โ | โ | - | - | - | Stable |
| Attention | โ | โ | - | - | - | Stable |
| Quantization | โ | โ | - | โ | ๐ง | Stable |
| Feature | Status | Notes |
|---|---|---|
| FP8 GEMM | Demo | Scaled FP16 behavior |
| TMA | Fallback | Async copy instead |
| Thread Block Clusters | Fallback | Block reduction |
| Winograd Conv | Fallback | Implicit GEMM path |
- NVIDIA CUTLASS - Reference implementations
- FlashAttention - Attention optimization
- How to Optimize a CUDA Matmul - Excellent tutorial
- NVIDIA CUDA Samples - Best practices
This project is licensed under the Apache License 2.0 - see LICENSE for details.
โญ Star this repo if you find it helpful!
Report Bug ยท Request Feature ยท Documentation
Made with โค๏ธ by the HPC-AI-Optimization-Lab Contributors