Skip to content

Latest commit

ย 

History

History
405 lines (297 loc) ยท 13.1 KB

File metadata and controls

405 lines (297 loc) ยท 13.1 KB

HPC-AI-Optimization-Lab

A Comprehensive CUDA Kernel Optimization Laboratory for AI Workloads

CUDA C++20 CMake License Docs

English | ็ฎ€ไฝ“ไธญๆ–‡


๐ŸŽฏ Overview

HPC-AI-Optimization-Lab is an educational and production-ready CUDA kernel library designed for AI inference workloads. It provides step-by-step optimized implementations of critical GPU operations, from basic elementwise operations to advanced Tensor Core matrix multiplication.

โœจ Why This Project?

Feature HPC-AI-Lab cuBLAS CUTLASS
Learning Focus โœ… Progressive optimization โŒ Black box โš ๏ธ Complex
Production Ready โœ… Tested & benchmarked โœ… Highly optimized โœ… Optimized
Easy to Use โœ… Simple API + Python โœ… API โš ๏ธ Templates
Educational โœ… 7-step GEMM journey โŒ No โš ๏ธ Advanced
Modern AI โœ… FlashAttention, RoPE, FP8 โœ… Yes โœ… Yes

Perfect for:

  • ๐ŸŽ“ Students: Learn CUDA optimization from first principles
  • ๐Ÿ”ฌ Researchers: Prototype new kernel optimizations
  • ๐Ÿญ Engineers: Production-ready kernels for AI workloads

๐Ÿš€ Quick Start

One-Minute Setup

# Clone, build, and test
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . -j$(nproc)
ctest --output-on-failure

Prerequisites

Requirement Version Notes
CUDA Toolkit 12.4+ Download
CMake 3.24+ pip install cmake or system package
C++ Compiler GCC 11+ / Clang 14+ C++20 support required
NVIDIA GPU Compute Capability 7.0+ Volta, Turing, Ampere, Hopper

Build Options

# Basic build (core library only)
cmake .. -DCMAKE_BUILD_TYPE=Release

# With examples and Python bindings
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DBUILD_EXAMPLES=ON \
         -DBUILD_PYTHON_BINDINGS=ON

# Target specific GPU architectures
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90"  # A100 + H100

Run Examples

# ReLU example (elementwise operation)
./examples/elementwise/relu_example

# GEMM benchmark (all 7 optimization steps)
./examples/gemm/gemm_benchmark

# Python usage (if bindings enabled)
python examples/python/basic_usage.py

๐Ÿ“Š Performance Highlights

GEMM Optimization Journey (FP32, 4096ร—4096, A100)

Step Technique Performance Speedup
1 Naive 0.5 TFLOPS 1ร— (baseline)
2 Shared Memory Tiling 2.0 TFLOPS 4ร—
3 Double Buffering 3.5 TFLOPS 7ร—
4 Register Tiling 6.0 TFLOPS 12ร—
5 Tensor Core WMMA 50+ TFLOPS 100ร—
6 Tensor Core MMA PTX 60+ TFLOPS 120ร—
7 Software Pipelining 70+ TFLOPS 140ร—

๐Ÿ’ก Key Insight: Tensor Core acceleration provides 100ร— speedup over naive implementation!

Module Performance Summary

Module Operations FP32 Perf Status
Elementwise ReLU, Sigmoid, Transpose Memory-bound โœ… Stable
Reduction Softmax, LayerNorm, RMSNorm Optimized โœ… Stable
GEMM Matrix multiplication 70+ TFLOPS โœ… Stable
Attention FlashAttention, RoPE IO-aware โœ… Stable
Convolution Implicit GEMM Competitive โœ… Stable

๐Ÿ“š Documentation

๐ŸŒ Online Documentation

Visit our comprehensive documentation at: https://lessup.github.io/hpc-ai-optimization-lab/

๐Ÿ“– Quick Links

Topic English ไธญๆ–‡
Getting Started Installation ๅฎ‰่ฃ…ๆŒ‡ๅ—
Quick Start 5-min Guide ๅฟซ้€Ÿๅ…ฅ้—จ
GEMM Optimization 7-Step Journey GEMMไผ˜ๅŒ–
Memory Optimization Guide ่ฎฟๅญ˜ไผ˜ๅŒ–
FlashAttention Guide FlashAttention
Performance Tuning Guide ๆ€ง่ƒฝ่ฐƒไผ˜
API Reference C++/Python API APIๅ‚่€ƒ

๐ŸŽ“ Recommended Learning Path

๐ŸŒฑ Beginner (1-2 weeks)
โ”œโ”€โ”€ Installation & Quick Start
โ”œโ”€โ”€ Memory Optimization (coalesced access, vectorization)
โ”œโ”€โ”€ Reduction Operations (warp shuffle, online algorithms)
โ””โ”€โ”€ GEMM Steps 1-4 (shared memory to register tiling)

๐Ÿš€ Intermediate (2-4 weeks)
โ”œโ”€โ”€ GEMM Steps 5-7 (Tensor Core WMMA, MMA PTX, pipelining)
โ”œโ”€โ”€ FlashAttention (IO-aware attention)
โ””โ”€โ”€ Profiling & Performance Tuning

๐Ÿ† Advanced (ongoing)
โ”œโ”€โ”€ CUDA 13 Hopper Features (TMA, Clusters, FP8)
โ”œโ”€โ”€ CUTLASS Source Code Study
โ””โ”€โ”€ Research Paper Implementations

๐Ÿ—๏ธ Project Structure

hpc-ai-optimization-lab/
โ”œโ”€โ”€ src/                        # CUDA kernel implementations
โ”‚   โ”œโ”€โ”€ common/                 # Shared utilities (Tensor, Timer, CUDA checks)
โ”‚   โ”œโ”€โ”€ elementwise/            # ReLU, Sigmoid, VectorAdd, Transpose
โ”‚   โ”œโ”€โ”€ reduction/              # Softmax, LayerNorm, RMSNorm
โ”‚   โ”œโ”€โ”€ gemm/                   # 7-step GEMM optimization (flagship!)
โ”‚   โ”œโ”€โ”€ convolution/            # Implicit GEMM, Winograd
โ”‚   โ”œโ”€โ”€ attention/              # FlashAttention, RoPE, TopK
โ”‚   โ”œโ”€โ”€ quantization/           # INT8/FP8 quantization
โ”‚   โ””โ”€โ”€ cuda13/                 # Hopper features (TMA, Clusters, FP8)
โ”‚
โ”œโ”€โ”€ tests/                      # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ common/                 # Utility tests
โ”‚   โ”œโ”€โ”€ elementwise/            # Elementwise tests
โ”‚   โ”œโ”€โ”€ gemm/                   # GEMM tests (property-based)
โ”‚   โ””โ”€โ”€ ...                     # All modules tested
โ”‚
โ”œโ”€โ”€ examples/                   # Standalone examples
โ”‚   โ”œโ”€โ”€ elementwise/            # ReLU example
โ”‚   โ”œโ”€โ”€ reduction/              # Softmax benchmark
โ”‚   โ”œโ”€โ”€ gemm/                   # GEMM benchmark
โ”‚   โ”œโ”€โ”€ convolution/            # Conv example
โ”‚   โ”œโ”€โ”€ attention/              # FlashAttention example
โ”‚   โ”œโ”€โ”€ quantization/           # Quantization example
โ”‚   โ”œโ”€โ”€ cuda13/                 # CUDA 13 example
โ”‚   โ””โ”€โ”€ python/                 # Python usage examples
โ”‚
โ”œโ”€โ”€ python/                     # Python bindings (nanobind)
โ”‚   โ”œโ”€โ”€ bindings/               # C++ binding code
โ”‚   โ””โ”€โ”€ benchmark/              # Python benchmarks
โ”‚
โ”œโ”€โ”€ docs/                       # Documentation (VitePress + Doxygen)
โ”‚   โ”œโ”€โ”€ en/                     # English documentation
โ”‚   โ”œโ”€โ”€ zh-CN/                  # Chinese documentation
โ”‚   โ””โ”€โ”€ .vitepress/             # VitePress configuration
โ”‚
โ”œโ”€โ”€ docker/                     # Docker environment
โ”‚   โ”œโ”€โ”€ Dockerfile
โ”‚   โ””โ”€โ”€ docker-compose.yml
โ”‚
โ””โ”€โ”€ .github/                    # CI/CD workflows
    โ””โ”€โ”€ workflows/
        โ”œโ”€โ”€ ci.yml              # Continuous Integration
        โ””โ”€โ”€ pages.yml           # Documentation deployment

๐Ÿ’ป Usage Examples

C++ API

#include "gemm/gemm.cuh"
#include "common/tensor.cuh"

// Allocate GPU tensors
auto A = hpc::common::make_tensor<float>(hpc::common::Device, {M, K});
auto B = hpc::common::make_tensor<float>(hpc::common::Device, {K, N});
auto C = hpc::common::make_tensor<float>(hpc::common::Device, {M, N});

// Launch optimized GEMM kernel
hpc::gemm::gemm<float, hpc::gemm::OptLevel::Advanced>(
    A.data(), B.data(), C.data(), M, N, K, stream);

Python API

import hpc_ai_opt
import numpy as np

# Create input data
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)

# Execute optimized GEMM
C = hpc_ai_opt.gemm(A, B)

print(f"Result shape: {C.shape}")
print(f"Performance: {hpc_ai_opt.last_tflops:.1f} TFLOPS")

๐Ÿงช Testing & Quality

Two-Tier Testing Strategy

Unit Tests (GoogleTest)

# Run all tests
ctest --output-on-failure

# Run specific test suite
./tests/gemm/test_gemm

Property-Based Tests (RapidCheck)

  • Automatically generates edge cases
  • Tests all input size combinations
  • Finds numerical stability issues

Test Coverage

Module Unit Tests Property Tests Coverage
Elementwise 12 48 95%+
Reduction 9 36 90%+
GEMM 15 60 98%+
Attention 8 32 92%+
Total 60+ 200+ 95%+

๐Ÿณ Docker Environment

Use our pre-configured Docker environment for hassle-free development:

# Start development environment
cd docker && docker-compose up -d
docker exec -it hpc-ai-lab bash

# Inside container: everything is pre-installed!
cmake -S . -B build && cmake --build build -j$(nproc)
ctest --test-dir build

๐Ÿค Contributing

We welcome contributions! This project follows Spec-Driven Development (SDD).

CI Scope Note: This repository does not currently provide full native CUDA build-and-test coverage in CI. The CI pipeline focuses on code formatting, consistency checks, and documentation builds. GPU-dependent tests require local execution or self-hosted runners.

Quick Start

# 1. Fork and clone
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab

# 2. Create feature branch
git checkout -b feature/my-optimization

# 3. Make changes and add tests
# Follow specs/ directory for requirements

# 4. Ensure tests pass
cmake -S . -B build && cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure

# 5. Commit and push
git commit -m "feat: optimize GEMM step 3"
git push origin feature/my-optimization

CI Status

โš ๏ธ Note: Current CI focuses on code formatting, consistency, and documentation. GPU tests require local execution or self-hosted runners.

See CONTRIBUTING.md for detailed guidelines.


๐Ÿ“ˆ Roadmap

Completed (v0.1.0 - v0.3.0) โœ…

  • Elementwise operations (4 kernels)
  • Reduction operations (3 kernels)
  • GEMM optimization (7 steps)
  • FlashAttention + RoPE + TopK
  • INT8/FP8 quantization
  • CUDA 13 Hopper features
  • Python bindings (nanobind)
  • Comprehensive documentation

In Progress (v0.4.0) ๐Ÿšง

  • FP8 GEMM (Hopper native)
  • Multi-GPU support
  • CUTLASS integration
  • Performance regression tests

Planned (v0.5.0+) ๐ŸŽฏ

  • MoE (Mixture of Experts) support
  • Sparse GEMM optimization
  • Auto-tuning framework
  • PyTorch integration

๐Ÿ“Š Support Matrix

Production-Ready โœ…

Module FP32 FP16 BF16 INT8 FP8 Status
Elementwise โœ… โœ… โœ… - - Stable
Reduction โœ… โœ… โœ… - - Stable
GEMM โœ… โœ… โœ… โœ… ๐Ÿšง Stable
Convolution โœ… โœ… - - - Stable
Attention โœ… โœ… - - - Stable
Quantization โœ… โœ… - โœ… ๐Ÿšง Stable

Experimental ๐Ÿงช

Feature Status Notes
FP8 GEMM Demo Scaled FP16 behavior
TMA Fallback Async copy instead
Thread Block Clusters Fallback Block reduction
Winograd Conv Fallback Implicit GEMM path

๐Ÿ™ Acknowledgments


๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see LICENSE for details.


โญ Star this repo if you find it helpful!

Report Bug ยท Request Feature ยท Documentation

Made with โค๏ธ by the HPC-AI-Optimization-Lab Contributors