GPU utility functions and memory management for ThemisDB.
Provides GPU compute integration for ThemisDB, implementing VRAM management with tenant quotas, multi-GPU load balancing, circuit breaker safe-fail, kernel validation, and parallel query acceleration.
In scope: VRAM allocation and tenant quotas, CUDA/ROCm device enumeration, circuit breaker with GPU→CPU fallback, audit event log, capability gate, kernel whitelist, Prometheus metrics, multi-GPU load balancer, parallel scan/filter/sort/aggregate/join, ROCm/HIP stream and device-memory backend.
Out of scope: GPU kernel implementations for specific algorithms (handled by acceleration module), model training orchestration (handled by training module).
gpu_memory_manager_edition.cpp— VRAM slab allocator with tenant quotassafe_fail.cpp— GPU→CPU safe-fail (circuit breaker)device_discovery.cpp— CUDA/ROCm device discoveryquery_accelerator.cpp— parallel query operationsmetrics.cpp— Prometheus metrics
Maturity: 🟡 Beta — VRAM management, circuit breaker, parallel query acceleration, and ROCm/HIP backend parity operational; multi-node coordination in progress.
| Header | Source | Description |
|---|---|---|
memory_manager.h |
gpu_memory_manager_edition.cpp |
Edition-aware VRAM allocation, tenant quotas, pre-allocation hints |
device_discovery.h |
device_discovery.cpp |
Enumerate CUDA/ROCm devices; CPU-fallback sentinel |
safe_fail.h |
safe_fail.cpp |
Circuit-breaker safe-fail with GPU→CPU fallback |
audit_log.h |
audit_log.cpp |
Ring-buffer structured audit event log |
policy.h |
policy.cpp |
Default-deny capability gate for GPU usage |
memory_pool.h |
memory_pool.cpp |
Slab-based pre-allocator with fragmentation tracking |
metrics.h |
metrics.cpp |
Prometheus-compatible counter/gauge registry |
config.h |
config.cpp |
GPU config validation, dry-run simulation |
kernel_validator.h |
kernel_validator.cpp |
FNV-1a checksum whitelist; validate-before-launch |
alerts.h |
alerts.cpp |
Threshold-based alert manager with callbacks |
launcher.h |
launcher.cpp |
Typed async work-item / batch launcher |
load_balancer.h |
load_balancer.cpp |
Multi-GPU load balancer (ROUND_ROBIN/LEAST_LOADED/FIRST_HEALTHY) |
feature_flags.h |
feature_flags.cpp |
Per-edition GPU feature gates with runtime overrides |
admin_api.h |
admin_api.cpp |
JSON admin stats, tenant breakdown, dry-run simulation |
gpu_module.h |
gpu_module.cpp |
Integration facade: policy→circuit-breaker→alloc→launch |
stream_manager.h |
stream_manager.cpp |
Named async GPU streams with CPU fallback budget |
query_accelerator.h |
query_accelerator.cpp |
Parallel scan/filter/sort/aggregate/join with GPU threshold dispatch |
tensor_buffer.h |
tensor_buffer.cpp |
Typed tensor containers with shape/dtype, views, checkpointing |
training_loop.h |
training_loop.cpp |
Training loop coordinator: batch iteration, loss tracking, early stopping |
rocm_backend.h |
rocm_backend.cpp |
ROCm/HIP backend: stream lifecycle, device memory, launcher BackendFn |
GPUModule (gpu_module.h)
├── GPUPolicy – default-deny capability gate
├── GPUSafeFailManager – circuit-breaker, GPU→CPU fallback
├── GPUMemoryManager – edition-aware VRAM, tenant quotas, hints
├── GPUMemoryPool – slab pre-allocator, zero-on-free
├── GPUDeviceDiscovery – enumerate devices, CPU-fallback sentinel
├── GPULoadBalancer – multi-device dispatch strategies
├── GPULauncher – typed async work-item / batch launcher
├── GPUStreamManager – named streams, CPU budget enforcement
├── GPUKernelValidator – checksum whitelist, validate-before-launch
├── GPUMetricsRegistry – Prometheus-compatible counters/gauges
├── GPUAlertManager – threshold alerts with callbacks
├── GPUAuditLog – ring-buffer structured event log
├── GPUAdminAPI – JSON stats, tenants, simulate endpoints
├── GPUFeatureFlags – per-edition feature gates
├── GPUConfig – startup validation, dry-run simulation
├── GPUQueryAccelerator – scan/filter/sort/aggregate/join
├── GPUTensorBuffer – typed tensors, views, checkpointing
├── GPUTrainingLoop – batch training coordinator
└── ROCmBackend – HIP stream lifecycle, device memory, launcher backend
| Edition | VRAM Limit | Notes |
|---|---|---|
| Community | 0 GB | CPU-only; all GPU paths fall back gracefully |
| Professional | 8 GB | Small models, limited acceleration |
| Enterprise | 24 GB | Medium models, production use |
| Hyperscaler/Unlimited | No limit | Large models, research |
#include "themis/gpu/gpu_module.h"
using namespace themis::gpu;
// Submit GPU work through the integration facade.
// Policy, circuit-breaker, VRAM allocation, metrics and audit are handled automatically.
GPUModule module;
module.SubmitWork("my-tenant", "index-build", [](float* buf, size_t n) {
// CPU-side stub — replace with real CUDA/ROCm kernel call
for (size_t i = 0; i < n; ++i) buf[i] *= 2.0f;
});#include "themis/gpu/memory_manager.h"
using namespace themis::gpu;
auto& mgr = GPUMemoryManager::GetInstance();
if (mgr.TryAllocateGPU(1ULL << 30, "vector-index", "tenant-a")) {
// use 1 GB VRAM ...
mgr.DeallocateGPU(1ULL << 30, "tenant-a");
}
auto stats = mgr.GetStats();
// stats.allocated_bytes, peak_bytes, allocation_count, deallocation_countAll components are thread-safe (mutex-protected). Concurrent alloc/dealloc, metric writes, and audit-log appends are safe.
- Edition Module: Edition-specific VRAM limits (
gpu_memory_manager_edition.cpp) - C++17 standard library:
<mutex>,<thread>,<atomic>— no external deps - CUDA/ROCm (optional): Hardware integration in
FUTURE_ENHANCEMENTS.md
- v1.0.0: Edition-aware GPU memory manager
- v1.1.0: Device discovery, safe-fail circuit breaker, audit log, policy gate
- v1.2.0: Memory pool, metrics, config validation, kernel validator, alerts, launcher, load balancer, feature flags, admin API, integration facade, stream manager, query accelerator, tensor buffer, training loop
- v1.3.0: ROCm/HIP backend parity (
rocm_backend.cpp): HIP stream lifecycle, device memory (hipMalloc/hipFree/hipMemset), launcher BackendFn with CPU fallback;GPUStreamManagerdefault backend now wires throughROCmBackend
- LLM Module — GPU model inference
- Vector Index — GPU-accelerated indexing
-
Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable Parallel Programming with CUDA. Queue, 6(2), 40–53. https://doi.org/10.1145/1365490.1365500
-
NVIDIA Corporation. (2023). CUDA C++ Programming Guide (v12.x). NVIDIA Developer Documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
-
Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., & Hwu, W. M. W. (2008). Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 73–82. https://doi.org/10.1145/1345206.1345220
-
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572