GPU Linear Algebra for Single-block Systems — a header-only CUDA library of BLAS- and LAPACK-like device functions designed for use within a single thread block.
GLASS functions are __device__ helpers that operate on data in shared or device memory. Every function assumes it runs within one CUDA block — the caller is responsible for launching one block per independent data item. This design enables composable GPU kernels for applications like model-predictive control and rigid-body dynamics.
GLASS started as a small set of hand-rolled SIMT subroutines (glass::, glass::cgrps::) tuned for very small matrices where the launch and dispatch overhead of a vendor library would dominate the actual work. It has since grown into a unified single-block linear-algebra surface that wraps NVIDIA's state-of-the-art device-side libraries — CUB (L1 reductions), cuBLASDx (L2/L3 GEMV/GEMM, including batched), and cuSOLVERDx (LAPACK: Cholesky, LU, QR, triangular and least-squares solves) — under the same __device__ calling convention. The intent is to give callers one consistent API across the full block-size compute scale: pure-SIMT for tiny matrices where the vendor path can't beat unrolled SIMT, and tensor-core-tuned vendor kernels for everything large enough to benefit from them.
Three namespaces are provided:
| Namespace | Backend | Header |
|---|---|---|
glass:: |
Hand-rolled SIMT, threadIdx.{x,y,z} / blockDim.* — no cgrps dep |
glass.cuh |
glass::cgrps:: |
Hand-rolled SIMT, g.thread_rank() / g.size() — cooperative groups |
glass-cgrps.cuh |
glass::nvidia:: |
CUB (L1) + cuBLASDx (L2/L3, batched) + cuSOLVERDx (LAPACK) — compile-time sizes only | glass-nvidia.cuh |
Both glass:: and glass::cgrps:: offer runtime (size as function arg) and compile-time (size as template arg) overloads for every function. The glass::nvidia:: wrappers preserve the same one-block, __device__ calling convention so a single kernel can mix hand-rolled and vendor-backed primitives without leaving the block — and switch between them by changing the namespace prefix when profiling shows one is faster than the other at a given size.
Reduction operations additionally offer glass::low_memory:: (no scratch, thread 0 accumulates) and glass::high_speed:: (warp-shuffle + shared-memory inter-warp reduction) sub-namespaces.
Three questions decide which API to call:
- Are sizes known at compile time?
- Is matrix size large enough that vendor-tuned tensor-core kernels matter?
- Can you launch with the thread count the backend wants?
| Scenario | Use | Reason |
|---|---|---|
| Sizes only known at runtime | glass::gemm(m, n, k, ...) |
Pure-SIMT, accepts dynamic args |
| Compile-time sizes, small matrices (≤ ~8×8), simple kernel | glass::gemm<float, M, N, K>(...) |
Compiler unrolls inner loops; ~1 µs/op overhead is hard to beat for tiny sizes |
| Compile-time sizes, larger matrices, tensor-core hardware | glass::nvidia::gemm<float, M, N, K>(...) |
cuBLASDx generates SM-specific tensor-core code |
| Compile-time sizes inside an existing kernel that uses a different thread count (e.g. GRiD's 352-thread launches) | glass::nvidia::gemm<float, M, N, K, TC>(...) with DEFINE_NVIDIA_GEMM_BLOCKDIM(M,N,K,TC) |
Pins cuBLASDx's BlockDim<TC,1,1>; lets you launch with any thread count ≥ TC (P0-1 in the proposal) |
| Need a transposed B (or row-major A/B/C) in the NVIDIA path | glass::nvidia::gemm<...,LA,LB,LC> with DEFINE_NVIDIA_GEMM_BLOCKDIM_LAYOUT(...) or _TRANSB alias |
cuBLASDx Arrangement; pure-SIMT fallback no longer needed |
Linear solve (Mx = b for SPD M) |
glass::nvidia::posv<float, N, NRHS>(...) |
cuSOLVERDx fused factor + solve; faster than chol+trsm at N ≥ 8 |
| Cholesky alone (factor only, custom solve) | glass::nvidia::chol_inplace<float, N>(...) |
cuSOLVERDx potrf |
| General linear solve (non-SPD) | glass::nvidia::gesv_no_pivot<float, N, NRHS>(...) |
cuSOLVERDx LU + solve |
| Least-squares / over- or under-determined | glass::nvidia::gels<float, M, N, NRHS>(...) |
cuSOLVERDx QR (or LQ for under-det) + solve |
BATCH independent GEMMs of the same shape, want to amortize launch |
glass::nvidia::gemm_batched<...,BATCH,TC> |
Single block, all batches active via threadIdx.y |
Heuristic for glass<CT> vs glass::nvidia GEMM/GEMV at the same compile-time size: for square sizes ≤ 12 the pure-SIMT compile-time variant is usually within 2× of cuBLASDx and avoids the cuBLASDx compile-time cost (instantiation can add 30 s to a build). For sizes ≥ 16 (or any rectangular shape that maps cleanly to tensor cores) glass::nvidia typically wins. The benchmark suite (bench/run_bench.py) prints both side by side — measure on your target SM before committing.
When NOT to use glass::nvidia:
- Sizes only known at runtime (the templates require compile-time
M,N,K). - You can't add a
DEFINE_NVIDIA_GEMM*macro for the size you need (e.g. you want every conceivable(M, N, K)triple — the macro instantiation cost grows fast). - You're stuck on an SM cuBLASDx doesn't tune for (it falls back to a generic config; the pure-SIMT compile-time path is often competitive there).
#include "glass.cuh"
__global__ void my_kernel(float* A, float* B, float* C, int m, int n, int k) {
// Runtime size: all threads in the block cooperate
glass::gemm(m, n, k, 1.f, A, B, 0.f, C);
}Launch with one block per data item:
my_kernel<<<num_items, 256>>>(A, B, C, m, n, k);#include "glass.cuh"
__global__ void k(float* A, float* B, float* C, int m, int n, int k) {
glass::gemm(m, n, k, 1.f, A, B, 0.f, C); // GEMM
glass::gemv(m, n, 1.f, A, B, 0.f, C); // GEMV
glass::axpy(n, 1.5f, A, B); // y = alpha*x + y
}#include "glass.cuh"
__global__ void k(float* A, float* B, float* C) {
// Sizes baked in as template params — compiler can unroll loops
glass::gemm<float, 6, 6, 6>(1.f, A, B, 0.f, C);
glass::gemv<float, 6, 6>(1.f, A, B, 0.f, C);
glass::axpy<float, 36>(1.5f, A, B);
}#include "glass-cgrps.cuh"
#include <cooperative_groups.h>
namespace cgrps = cooperative_groups;
__global__ void k(float* A, float* B, float* C) {
// Whole block (default)
glass::cgrps::gemm<float, 6, 6, 6>(1.f, A, B, 0.f, C);
// Warp-level tiling: each warp independently computes a 4×4×4 GEMM
auto warp = cgrps::tiled_partition<32>(cgrps::this_thread_block());
glass::cgrps::gemm<float, 4, 4, 4>(1.f, A, B, 0.f, C, warp);
}#include "glass.cuh"
__global__ void k(float* x, int n) {
// Scratch: ceil(blockDim.x / 32) * sizeof(float) bytes
extern __shared__ float scratch[];
glass::high_speed::reduce(n, x, scratch); // result in x[0]
glass::high_speed::l2norm(n, x, scratch); // ‖x‖₂ in x[0]
}#include "glass-nvidia.cuh"
// Host: query required smem and thread count (both constexpr)
constexpr auto smem = glass::nvidia::gemm_smem_size<float, 6, 6, 6>();
constexpr auto threads = glass::nvidia::gemm_threads<float, 6, 6, 6>();
__global__ void k(float* A, float* B, float* C) {
extern __shared__ __align__(16) char smem_buf[];
glass::nvidia::gemm<float, 6, 6, 6>(1.f, A, B, 0.f, C, smem_buf);
}
// Launch with the EXACT thread count cuBLASDx wants — mismatch deadlocks.
k<<<1, threads, smem>>>(dA, dB, dC);The default form forces every kernel that touches glass::nvidia::gemm to launch with the exact thread count cuBLASDx picked for that GEMM size. When the rest of your kernel needs a different launch (e.g. GRiD codegen launches with 352 threads for the surrounding RNEA/CRBA work), use the BlockDim form:
namespace glass { namespace nvidia {
DEFINE_NVIDIA_GEMM_BLOCKDIM(6, 6, 6, 352) // pin BlockDim<352,1,1>
}}
__global__ void k(float* A, float* B, float* C) {
extern __shared__ __align__(16) char smem_buf[];
// OK to launch with 352 threads — extras go idle inside the GEMM.
glass::nvidia::gemm<float, 6, 6, 6, 352>(1.f, A, B, 0.f, C, smem_buf);
}
constexpr auto smem = glass::nvidia::gemm_smem_size<float, 6, 6, 6, 352>();
k<<<1, 352, smem>>>(dA, dB, dC);The query API tells you the smallest TC cuBLASDx will accept for a (T, M, N, K, SM) tuple:
static_assert(glass::nvidia::gemm_block_threads_valid<float, 6, 6, 6, 352>(),
"352 threads should be enough for 6x6x6 on this SM");
constexpr uint32_t MIN = glass::nvidia::gemm_min_block_threads<float, 6, 6, 6>();
// MIN is the natural BlockDim cuBLASDx picks; pin TC >= MIN.glass::nvidia::gemm accepts layout LA, LB, LC template parameters — these mirror cuBLASDx's Arrangement<> and let you express transpose / row-major storage without falling back to the pure-SIMT path:
namespace glass { namespace nvidia {
// A · Bᵀ (B is row-major, equivalent to "transposed col-major")
DEFINE_NVIDIA_GEMM_BLOCKDIM_TRANSB(6, 6, 6, 352) // alias for LB=row_major
// Or fully explicit (LA=row, LB=col, LC=col):
DEFINE_NVIDIA_GEMM_BLOCKDIM_LAYOUT(6, 6, 6, 352, 1, 0, 0)
}}
__global__ void k(float* A, float* B, float* C) {
extern __shared__ __align__(16) char smem_buf[];
using L = glass::nvidia::layout;
glass::nvidia::gemm<float, 6, 6, 6, 352,
L::col_major, L::row_major, L::col_major>(
1.f, A, B, 0.f, C, smem_buf);
}Pin the SM at the call site so a single fatbinary can ship tuned code for multiple architectures:
namespace glass { namespace nvidia {
DEFINE_NVIDIA_GEMM_BLOCKDIM_SM(6, 6, 6, 352, 890)
DEFINE_NVIDIA_GEMM_BLOCKDIM_SM(6, 6, 6, 352, 1200)
}}
__global__ void k(float* A, float* B, float* C) {
extern __shared__ __align__(16) char smem_buf[];
using L = glass::nvidia::layout;
glass::nvidia::gemm<float, 6, 6, 6, 352,
L::col_major, L::col_major, L::col_major,
#if __CUDA_ARCH__ >= 1200
1200
#else
890
#endif
>(1.f, A, B, 0.f, C, smem_buf);
}#include "glass-nvidia.cuh"
namespace glass { namespace nvidia {
DEFINE_NVIDIA_POSV_BLOCKDIM(7, 1, 256) // 7×7 SPD, 1 RHS, BlockDim<256>
}}
__global__ void k(float* A, float* b) {
extern __shared__ __align__(16) char smem_buf[];
// Solves A·x = b in place: A := L (lower Cholesky), b := x
glass::nvidia::posv<float, 7, 1, 256>(A, b, smem_buf);
}
constexpr auto smem = glass::nvidia::posv_smem_size<float, 7, 1, 256>();
k<<<1, 256, smem>>>(dA, db);Available cuSOLVERDx wrappers (all follow the same DEFINE_NVIDIA_<NAME> / _BLOCKDIM / _SM / _BLOCKDIM_SM macro pattern):
| Function | Signature | NumPy / SciPy equivalent |
|---|---|---|
chol_inplace<T, N> |
(A, smem) |
np.linalg.cholesky(A) (lower) |
trsm<T, M, N> |
(alpha, L, B, smem) |
scipy.linalg.solve_triangular(L, alpha*B, lower=True) |
posv<T, N, NRHS> |
(A, B, smem) |
np.linalg.solve(A, B) (SPD A; A is destroyed) |
potrs<T, N, NRHS> |
(L, B, smem) |
scipy.linalg.cho_solve((L, True), B) |
getrf_no_pivot<T, N> |
(A, smem) |
scipy.linalg.lu_factor(A) (no pivoting; A := LU) |
getrs_no_pivot<T, N, NRHS> |
(LU, B, smem) |
scipy.linalg.lu_solve((LU, ...), B) |
gesv_no_pivot<T, N, NRHS> |
(A, B, smem) |
np.linalg.solve(A, B) (general A; A is destroyed) |
geqrf<T, M, N> |
(A, tau, smem) |
scipy.linalg.qr(A, mode='raw') |
gels<T, M, N, NRHS> |
(A, tau, B, smem) |
np.linalg.lstsq(A, B) |
Linking note: cuSOLVERDx ships a precompiled device library. Add -rdc=true -dlto -L$MATHDX_ROOT/lib -lcusolverdx -lcublas -lcusolver -lcudart to your nvcc command (the bench/run_bench.py driver does this automatically when cusolverdx.hpp is present).
namespace glass { namespace nvidia {
DEFINE_NVIDIA_GEMM_BATCHED_BLOCKDIM(6, 6, 6, /*BATCH=*/16, /*TC=*/64)
}}
__global__ void k(float* const* A, float* const* B, float* const* C) {
extern __shared__ __align__(16) char smem_buf[];
glass::nvidia::gemm_batched<float, 6, 6, 6, 16, 64>(
1.f, A, B, 0.f, C, smem_buf);
}
constexpr auto smem = glass::nvidia::gemm_batched_smem_size<float, 6, 6, 6, 16, 64>();
k<<<1, dim3(64, 16), smem>>>(dA_ptrs, dB_ptrs, dC_ptrs);Each batch element is identified by threadIdx.y; each per-batch GEMM uses TC threads in threadIdx.x. Total launch is dim3(TC, BATCH). See bench/bench_gemm_batched.cu for a full working example with T** pointer arrays.
#include "glass.cuh"
__global__ void k(float* A, float* B, float* C, int m, int n, int k) {
extern __shared__ float smem[];
float* s_A = smem;
float* s_B = smem + m * 8;
glass::gemm_tiled<float, 8>(m, n, k, 1.f, A, B, 0.f, C, s_A, s_B);
}
// Host:
size_t smem = (m * 8 + 8 * k) * sizeof(float);
k<<<1, 256, smem>>>(A, B, C, m, n, k);| Component | C++ Standard | Optional deps |
|---|---|---|
glass.cuh, glass-cgrps.cuh |
C++17 (nvcc -std=c++17) |
— |
glass-nvidia.cuh (L1 only) |
C++17 | CUB (bundled with CUDA 11+) |
glass-nvidia.cuh (L2/L3 GEMM/GEMV/batched) |
C++17 + --expt-relaxed-constexpr |
cuBLASDx |
glass-nvidia.cuh (LAPACK: chol/trsm/posv/getrf/gesv/geqrf/gels) |
C++17 + --expt-relaxed-constexpr + -rdc=true -dlto -lcusolverdx -lcublas -lcusolver -lcudart |
cuSOLVERDx |
| Test suite | C++17 | — |
| Benchmark suite | C++17 | (matches the variants you opt in) |
The wrappers gate themselves with GLASS_HAVE_CUBLASDX / GLASS_HAVE_CUSOLVERDX, both auto-detected from include order. To force-enable when including glass-nvidia.cuh from a TU that hasn't pre-included the headers, define GLASS_BENCH_CUBLASDX and/or GLASS_BENCH_CUSOLVERDX (the bench harness sets these). See bench/INSTALL.md for download + linking instructions.
All L1 functions accept uint32_t n (runtime) or <T, N> template args (compile-time).
| Function | Description | NumPy equivalent | Scratch |
|---|---|---|---|
axpy(n, alpha, x, y) |
y = alpha*x + y |
y += alpha*x |
none |
axpy(n, alpha, x, y, z) |
z = alpha*x + y |
z = alpha*x + y |
none |
axpby(n, alpha, x, beta, y, z) |
z = alpha*x + beta*y |
z = alpha*x + beta*y |
none |
copy(n, x, y) |
y = x |
y = x.copy() |
none |
copy(n, alpha, x, y) |
y = alpha*x |
y = alpha*x |
none |
scal(n, alpha, x) |
x = alpha*x (in-place) |
x *= alpha |
none |
scal(n, alpha, x, y) |
y = alpha*x |
y = alpha*x |
none |
swap(n, x, y) |
swap x and y |
— | none |
dot(n, x, y) |
y[0] = x·y (in-place, uses y as scratch) |
np.dot(x,y) |
none |
reduce(n, x) |
x[0] = sum(x) (in-place) |
np.sum(x) |
none |
l2norm(n, x) |
x[0] = ‖x‖₂ (in-place, destructive) |
np.linalg.norm(x) |
none |
infnorm(n, x) |
x[0] = ‖x‖∞ (in-place) |
np.max(np.abs(x)) |
none |
asum(n, x, out) |
`out[0] = Σ | xᵢ | ` |
clip(n, x, l, u) |
x = clamp(x, l, u) |
np.clip(x, l, u) |
none |
set_const(n, alpha, x) |
x = [alpha, …] |
np.full(n, alpha) |
none |
loadIdentity(n, A) |
A = I_n (column-major) |
np.eye(n) |
none |
addI(n, A, alpha) |
A += alpha*I |
A += alpha*np.eye(n) |
none |
transpose(N, M, a, b) |
b = aᵀ (col-major) |
b = a.T |
none |
transpose(N, a) |
in-place transpose N×N |
— | none |
elementwise_add(N, a, b, c) |
c = a + b |
c = a + b |
none |
elementwise_sub(N, a, b, c) |
c = a - b |
c = a - b |
none |
elementwise_mult(N, a, b, c) |
c = a ⊙ b |
c = a * b |
none |
elementwise_abs(N, a, b) |
`b = | a | ` |
elementwise_max(N, a, b, c) |
c = max(a,b) |
np.maximum(a,b) |
none |
elementwise_min(N, a, b, c) |
c = min(a,b) |
np.minimum(a,b) |
none |
prefix_sum_exclusive(x, out, n) |
exclusive scan | — | none |
prefix_sum_inclusive(x, out, n) |
inclusive scan | np.cumsum(x) |
none |
glass::reduce(n, x); // default: halving reduce
glass::low_memory::reduce(n, x); // thread 0 accumulates; no scratch
glass::high_speed::reduce(n, x, scratch); // warp-shuffle; faster for large nScratch required for high_speed: ceil(blockDim.x / 32) * sizeof(T) bytes.
#include "glass-nvidia.cuh"
extern __shared__ float scratch[]; // sizeof(cub::BlockReduce<T,THREADS>::TempStorage)
glass::nvidia::reduce<float, N, THREADS>(x, scratch);
glass::nvidia::dot<float, N, THREADS>(x, y, out, scratch);
glass::nvidia::l2norm<float, N, THREADS>(x, out, scratch);
// Query scratch size (host-callable):
constexpr auto bytes = glass::nvidia::reduce_smem_size<float, THREADS>();Matrices default to column-major order. Pass ROW_MAJOR=true to use row-major (C-style).
| Function | Description | NumPy equivalent |
|---|---|---|
gemv<T>(m, n, alpha, A, x, beta, y) |
y = alpha*A*x + beta*y (col-major A) |
y = alpha*A@x + beta*y |
gemv<T,true>(m, n, alpha, A, x, beta, y) |
y = alpha*Aᵀ*x + beta*y |
y = alpha*A.T@x + beta*y |
gemv<T,false,true>(m, n, alpha, A, x, beta, y) |
same, but A is row-major | y = alpha*A@x + beta*y |
gemv_ex<T,TRANSPOSE,ROW_MAJOR_A>(...) |
per-matrix layout control | |
ger(m, n, alpha, x, y, A) |
A += alpha*x*yᵀ |
A += alpha*np.outer(x,y) |
Compile-time overloads omit the m, n args:
glass::gemv<float, 6, 6>(1.f, A, x, 0.f, y); // runtime: glass::gemv(6, 6, 1.f, A, x, 0.f, y)
glass::cgrps::gemv<float, 6, 6>(1.f, A, x, 0.f, y, g); // with explicit groupMatrices default to column-major order.
| Function | Description | NumPy equivalent | Scratch |
|---|---|---|---|
gemm<T>(m,n,k, alpha, A, B, beta, C) |
C = alpha*A*B + beta*C (col-major) |
C = alpha*A@B + beta*C |
none |
gemm<T,true>(m,n,k, alpha, A, B, beta, C) |
C = alpha*A*Bᵀ + beta*C |
— | none |
gemm<T,false,true>(m,n,k, alpha, A, B, beta, C) |
same, all matrices row-major | none | |
gemm_ex<T,TRANSPOSE_B,ROW_A,ROW_B,ROW_C>(...) |
per-matrix layout control | none | |
gemm_tiled<T,TILE>(m,n,k, alpha, A, B, beta, C, s_A, s_B) |
tiled gemm using shared memory | (m*TILE + TILE*k)*sizeof(T) |
|
gemm_dispatch<T>(m,n,k, alpha, A, B, beta, C, s_A, s_B) |
auto-selects tiled or plain | see below | |
invertMatrix(n, A, s_temp) |
A = A⁻¹ in-place (Gauss-Jordan) |
np.linalg.inv(A) |
(2n+1)*sizeof(T) |
cholDecomp_InPlace(n, A) |
Cholesky A → L (lower triangular) |
np.linalg.cholesky(A) |
none |
trsm(n, L, b) |
Solve Lx=b in-place (forward substitution) |
scipy.linalg.solve_triangular(L,b,lower=True) |
none |
Compile-time overloads omit the m, n, k args:
glass::gemm<float, 6, 6, 6>(1.f, A, B, 0.f, C);
glass::cgrps::gemm<float, 6, 6, 6>(1.f, A, B, 0.f, C, g);glass::gemm_dispatch selects tiled or plain gemm at runtime based on whether scratch pointers are provided and m*k ≤ blockDim:
extern __shared__ float scratch[];
float *s_A = scratch, *s_B = scratch + m * 8;
glass::gemm_dispatch(m, n, k, 1.f, A, B, 0.f, C,
(smem > 0) ? s_A : nullptr,
(smem > 0) ? s_B : nullptr);Use the host helper to compute required scratch at launch:
#include "glass.cuh"
std::size_t smem = glass_gemm_dispatch_smem(m, k, /*block_threads=*/256);
my_kernel<<<grid, 256, smem>>>(m, n, k, A, B, C);// Column-major (default): A[row + col*m]
glass::gemm<float>(m, n, k, 1.f, A, B, 0.f, C);
// Row-major (all matrices): A[row*cols + col]
glass::gemm<float, false, true>(m, n, k, 1.f, A, B, 0.f, C);
// Mixed layouts (row-major A and C, column-major B):
glass::gemm_ex<float, false, true, false, true>(m, n, k, 1.f, A, B, 0.f, C);cuBLASDx computes at warp/tensor-core level and requires compile-time matrix sizes. cuSOLVERDx (factorizations + solves) likewise requires compile-time sizes and additionally requires linking against a precompiled device library. Include glass-nvidia.cuh to get pre-instantiated sizes; both backends are auto-detected.
Pre-instantiated GEMM/GEMV sizes (square): 4, 6, 8, 12, 14, 24, 64. cuSOLVERDx wrappers are not pre-instantiated; you must call the appropriate DEFINE_NVIDIA_* macro per size in your .cu file (inside namespace glass::nvidia).
// Common parameter pack for gemm / gemv (and gemm_batched, with extra BATCH):
template <typename T,
uint32_t M, uint32_t N, uint32_t K,
uint32_t BLOCK_THREADS = 0, // 0 = let cuBLASDx pick
layout LA = layout::col_major, // arrangement of A
layout LB = layout::col_major, // arrangement of B
layout LC = layout::col_major, // arrangement of C
uint32_t SM_VAL = SMS> // SM arch (default = SMS macro)
__device__ void gemm(T alpha, T* A, T* B, T beta, T* C, char* smem);SMS defaults to 860 and may be overridden with -DSMS=XXX at compile time. See sections 5b–5d above for usage examples (BlockDim, layout, multi-arch).
Every wrapper exposes the same family of compile-time DEFINE macros. Substitute GEMM / GEMV / GEMM_BATCHED / CHOL / TRSM / POSV / POTRS / GETRF / GETRS / GESV / GEQRF / GELS:
| Macro form | Specializes |
|---|---|
DEFINE_NVIDIA_<NAME>(...) |
default block_dim, all col_major, SM = SMS |
DEFINE_NVIDIA_<NAME>_BLOCKDIM(..., TC) |
pinned BlockDim<TC,1,1>, all col_major, SM = SMS |
DEFINE_NVIDIA_<NAME>_LAYOUT(..., LA, LB, LC) (GEMM only) |
default block_dim, custom layouts, SM = SMS |
DEFINE_NVIDIA_<NAME>_BLOCKDIM_LAYOUT(..., TC, LA, LB, LC) |
pinned + custom layouts |
DEFINE_NVIDIA_<NAME>_SM(..., SM) |
explicit SM, default block_dim, all col_major |
DEFINE_NVIDIA_<NAME>_BLOCKDIM_SM(..., TC, SM) |
pinned + explicit SM |
DEFINE_NVIDIA_<NAME>_LAYOUT_SM(..., LA, LB, LC, SM) (GEMM only) |
custom layouts + explicit SM |
DEFINE_NVIDIA_<NAME>_BLOCKDIM_LAYOUT_SM(..., TC, LA, LB, LC, SM) |
every parameter explicit |
Layout arguments (LA, LB, LC) are integer literals: 0 = col_major, 1 = row_major. GRiD-flag aliases for the common transpose case:
#define DEFINE_NVIDIA_GEMM_TRANSB(M, N, K) // alias for LAYOUT(M, N, K, 0, 1, 0)
#define DEFINE_NVIDIA_GEMM_BLOCKDIM_TRANSB(M, N, K, TC) // alias for BLOCKDIM_LAYOUT(M, N, K, TC, 0, 1, 0)To add a custom size, place the macro inside namespace glass::nvidia in your .cu file:
#include "glass-nvidia.cuh"
namespace glass { namespace nvidia {
DEFINE_NVIDIA_GEMM(16, 16, 16)
DEFINE_NVIDIA_GEMV(16, 16)
DEFINE_NVIDIA_GEMM_BLOCKDIM(16, 16, 16, 256) // pinned thread count
DEFINE_NVIDIA_GEMM_BLOCKDIM_TRANSB(16, 16, 16, 256) // pinned + B is row-major
DEFINE_NVIDIA_POSV_BLOCKDIM(16, 4, 256) // SPD solve, 4 RHS
}}// Smallest BlockDim cuBLASDx accepts for (T, M, N, K, SM). All constexpr.
template <typename T, uint32_t M, uint32_t N, uint32_t K, uint32_t SM_VAL = SMS>
constexpr uint32_t glass::nvidia::gemm_min_block_threads();
// True iff BLOCK_THREADS >= the minimum.
template <typename T, uint32_t M, uint32_t N, uint32_t K, uint32_t BT, uint32_t SM_VAL = SMS>
constexpr bool glass::nvidia::gemm_block_threads_valid();
// Same for gemv (Size<M, 1, N>):
constexpr uint32_t glass::nvidia::gemv_min_block_threads<T, M, N, SM_VAL>();
constexpr bool glass::nvidia::gemv_block_threads_valid<T, M, N, BT, SM_VAL>();These do not require a DEFINE_NVIDIA_* macro — they construct the GEMM type inline and read block_dim directly. Useful for codegen that wants to pick SUGGESTED_THREADS at generation time.
Compile without -DNDEBUG and the wrappers assert(blockDim >= GEMM::block_dim) inside every run(). Misconfigured launches now fail with a clean assertion message rather than silently deadlocking. The assertions compile out under -DNDEBUG.
cd test
pip install -r requirements.txt
pytest -vThe first run compiles the CUDA test binaries using nvcc. Subsequent runs skip recompilation unless source files have changed (cached by content hash). Compiled binaries are placed in test/build/.
pytest test_l1.py -v
pytest test_l1.py -k "simple" # glass:: threadIdx variants only
pytest test_l1.py -k "cg" # glass::cgrps:: cooperative groups variants
pytest test_l1.py -k "simple_hs" # high_speed warp-shuffle variantsThe benchmark suite compares GLASS variants against block-level CUDA library baselines AND against each other:
| File | Comparison |
|---|---|
bench_reduce.cu |
glass::*::reduce/dot/l2norm (plain, low_memory, high_speed, compile-time) vs CUB BlockReduce vs glass::nvidia::reduce (CUB-backed) |
bench_gemv.cu |
glass::gemv (runtime + compile-time) vs raw cuBLASDx vs glass::nvidia::gemv (default block_dim + caller-pinned BlockDim<256>) |
bench_gemm.cu |
glass::gemm (plain, tiled, compile-time) vs raw cuBLASDx vs glass::nvidia::gemm (default + caller-pinned) |
bench_blockdim.cu |
glass::nvidia::gemm with cuBLASDx-chosen block_dim vs caller-pinned BlockDim<128> vs BlockDim<352> (the GRiD iiwa14 launch that pre-fix would have deadlocked) |
bench_gemm_batched.cu |
glass::nvidia::gemm_batched<...,BATCH> vs naive for(b) loop calling gemm BATCH times, for BATCH ∈ {4, 8, 16, 32} |
bench_lapack.cu (needs cuSOLVERDx) |
pure-SIMT glass::cholDecomp_InPlace / glass::trsm vs glass::nvidia::chol_inplace / trsm / posv (fused) |
CUB is bundled with CUDA 11+. cuBLASDx and cuSOLVERDx ship together in NVIDIA MathDx (see bench/INSTALL.md).
# Set MATHDX_ROOT to your MathDx installation, then:
python3 bench/run_bench.py
# Custom iteration count (default: 10000):
python3 bench/run_bench.py --iters 50000
# Skip cuBLASDx (bench_gemv/gemm/blockdim/batched/lapack will not run):
python3 bench/run_bench.py --no-cublasdxbench_lapack is automatically skipped if cusolverdx.hpp is not present under $MATHDX_ROOT/include/.
Anti-optimization safeguards are baked into every bench loop: per-iteration writes to a volatile sink defeat dead-store elimination; destructive inputs (Cholesky, LU, QR all overwrite their input) are reloaded from a master copy each iteration; nvcc -Xptxas -O1 is enforced. Numbers below ~0.1 µs/op for a non-trivial kernel almost always indicate the bench was elided — recheck the safeguards if you see that.
Timing uses the GRiD pattern — the iteration loop runs inside the kernel to amortize launch overhead. Results are printed as a Markdown table and saved to bench/results/bench_<hostname>.json.
- Matrices default to column-major (Fortran) order, consistent with cuBLAS. Pure-SIMT
glass::uses aROW_MAJOR=truetemplate parameter;glass::nvidia::uses thelayoutenum (col_major/row_major) per matrix viaLA,LB,LCtemplate arguments. dotand reduction variants modify the input array in-place (result inx[0]);l2normsquares elements before reducing.cholDecomp_InPlaceonly fills the lower triangle of the matrix; the upper triangle retains input values.glass::gemm(pure-SIMT) withTRANSPOSE_B=truecurrently requires B to be square (n×n). Theglass::nvidia::gemmpath has no such restriction — useLB = layout::row_major(orDEFINE_NVIDIA_GEMM_BLOCKDIM_TRANSB).glass::nvidia::*(default form) requires exactly the thread count returned bygemm_threads<T,M,N,K>(). Use theBLOCK_THREADStemplate parameter (withDEFINE_NVIDIA_<NAME>_BLOCKDIM) to launch with any thread count≥ gemm_min_block_threads<T,M,N,K>()— extras go idle inside the GEMM. Compile without-DNDEBUGto get a clean assertion if the launch is too small instead of a silent deadlock.glass::nvidia::trsmdoes not accept a non-1.0alphanatively (cuSOLVERDx'strsmhas no alpha); the wrapper pre-multipliesBbyalphain shared memory before calling execute.