Ultra-fast, modular LLM inference engine with a Vulkan compute backend
Designed to surpass llama.cpp in throughput and VRAM efficiency.
┌──────────────────────────────────────────────────────────────────────────┐
│ Sovereign Engine v0.2.10 │
│ C++20 · Vulkan 1.3 · SPIR-V Compute · pybind11 · Mixed INT4 Quant │
└──────────────────────────────────────────────────────────────────────────┘
- Overview
- Key Features
- Architecture
- Project Structure
- Requirements
- Building
- Usage
- The .sovereign Format
- Quantiser
- Memory Manager
- KV Cache (PagedAttention)
- Vulkan Compute Shaders
- Running Tests
- Roadmap
- Contributing
- License
Sovereign Engine is a from-scratch, GPU-first LLM inference runtime written in C++20.
It targets local inference on consumer hardware (NVIDIA/AMD/Intel) using Vulkan compute as the sole GPU backend, which means:
- No CUDA dependency — runs on any Vulkan 1.2+ GPU.
- Tight control over VRAM: paged KV cache, async layer streaming, dynamic CPU offload.
- Mixed-precision quantisation inspired by EXL2 and HQQ — assign INT4/INT3/INT2 per-tensor based on measured sensitivity.
- A clean Python API (via pybind11) and a stable C ABI for FFI from any language.
| Feature | Details |
|---|---|
| Vulkan backend | Compute-only, no graphics queue needed. Works on NVIDIA, AMD, Intel, ARM Mali. |
| Mixed-precision quantisation | FP16 → INT8 → Q4_K → Q3_K → Q2_K per tensor, HQQ solver, EXL2-style importance scoring. |
| Async layer pipeline | Double-buffered PCIe staging: GPU runs layer N while CPU DMA-copies layer N+1. |
| PagedAttention KV cache | Block-based VRAM pool, copy-on-write forking, O(1) alloc/free. |
| Dynamic CPU offload | Falls back to AVX-512 / NEON when VRAM pressure exceeds threshold. |
| Streaming generation | Token-by-token callback; GIL-safe Python generator. |
| Rich sampling | Temperature, Top-P, Top-K, Min-P, Repetition Penalty, Mirostat v1/v2, GBNF grammar, JSON schema. |
Proprietary .sovereign format |
Page-aligned mmap, per-tensor CRC32C, zero-copy Vulkan upload. |
| GQA / MHA / MQA | All attention variants supported via a single fused GLSL shader. |
| RoPE + sliding window | Inline rotary embeddings, optional Mistral/Gemma sliding-window mask. |
┌─────────────────────────────────────────────────────────────────────────┐
│ Python / C++ / C │
│ (sovereign_inference.Engine) │
└────────────────────────────────┬────────────────────────────────────────┘
│
┌────────────▼────────────┐
│ engine.cpp │ prefill / decode_step /
│ (inference loop) │ generate / forward
└──┬──────┬───────┬───────┘
│ │ │
┌─────────────▼─┐ ┌──▼────┐ ┌▼──────────────────┐
│ VulkanContext │ │Quant │ │ AsyncMemoryManager │
│ (device, │ │izer │ │ (layer streaming, │
│ pipelines, │ │ │ │ CPU offload) │
│ cmd bufs) │ └───────┘ └────────────────────┘
└───────┬────────┘ │
│ ┌─────────▼──────────┐
┌───────────▼────────────┐ │ PagedKVCache │
│ SPIR-V Compute Shaders│ │ (block pool, │
│ ┌─────────────────┐ │ │ CoW fork, │
│ │ rmsnorm.comp │ │ │ descriptor sets) │
│ │ matmul_int4.comp│ │ └────────────────────┘
│ │ attention_gqa │ │
│ │ silu_gate.comp │ │
│ │ sampler.comp │ │
│ └─────────────────┘ │
└────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ sovereign-convert CLI │
│ SafeTensors → profile → budget allocate → HQQ quant │
│ → pack INT4/3/2 → write .sovereign │
└────────────────────────────────────────────────────────────┘
sovereign-engine/
├── CMakeLists.txt # Root build configuration
├── README.md
├── .gitignore
│
├── include/sovereign/ # Public C++ headers
│ ├── engine.hpp # Top-level inference API
│ ├── format.hpp # .sovereign binary format spec
│ ├── vulkan_context.hpp # Vulkan device + pipeline management
│ ├── memory_manager.hpp # Async pipeline memory manager
│ ├── kv_cache.hpp # PagedAttention KV cache
│ └── quantizer.hpp # Mixed-precision quantiser
│
├── src/
│ ├── vulkan/
│ │ └── vulkan_context.cpp
│ ├── format/
│ │ └── format.cpp
│ ├── compute/
│ │ └── kv_cache.cpp
│ ├── inference/
│ │ └── engine.cpp
│ ├── quantizer/
│ │ └── quantizer.cpp
│ └── memory/
│ └── memory_manager.cpp
│
├── shaders/ # GLSL compute shaders (compiled to SPIR-V)
│ ├── rmsnorm.comp
│ ├── matmul_int4.comp
│ ├── attention_gqa.comp
│ ├── silu_gate.comp
│ └── sampler.comp
│
├── bindings/
│ └── python/
│ └── sovereign_py.cpp # pybind11 Python bindings
│
├── tools/
│ └── converter/
│ └── main.cpp # sovereign-convert CLI
│
├── tests/
│ ├── CMakeLists.txt
│ ├── test_format.cpp
│ ├── test_quantizer.cpp
│ ├── test_kv_cache.cpp
│ └── test_engine.cpp
├── package.json # Shader compiler package metadata
├── package-lock.json # Shader compiler lock file
│
├── examples/
│ └── basic_generate.py # Python streaming example
│
├── scripts/
│ ├── build.sh # Build helper script
│ └── compile_shaders.js # Shader compiler tool using WebGPU glslang
│
└── third_party/
├── volk/ # Meta-loader for dynamic Vulkan loading (tracked)
│ ├── volk.h
│ └── volk.c
└── vk_mem_alloc.h # Fetched automatically via CMake (not tracked)
- Vulkan 1.2+ Compatible GPU: Works on NVIDIA, AMD, Intel, Apple Silicon (via MoltenVK), and ARM Mali.
- GPU Driver: Must support Vulkan 1.2 and the required extensions listed below. No SDK required at runtime!
Thanks to our dynamic meta-loader architecture (volk) and automatic CMake dependency management, the Vulkan SDK is completely optional to build Sovereign Engine!
| Dependency | Version | Mandatory? | Notes |
|---|---|---|---|
| CMake | ≥ 3.25 | Yes | Handles the build orchestration |
| C++ Compiler | C++20 | Yes | MSVC 2022 / GCC 12+ / Clang 15+ |
| Vulkan SDK | ≥ 1.3 | No (Optional) | If absent, CMake automatically fetches headers; uses precompiled SPIR-V shaders |
| Python | ≥ 3.9 | No (Optional) | Only required to compile Python/pybind11 bindings |
Your GPU driver must support:
VK_KHR_timeline_semaphore (core in 1.2)
VK_KHR_synchronization2 (core in 1.3)
VK_EXT_memory_budget
VK_KHR_buffer_device_address
VK_KHR_shader_float16_int8
VK_EXT_scalar_block_layout
VK_KHR_8bit_storage
VK_KHR_16bit_storage
# Clone
git clone https://github.com/corbac10099/sovereign-engine.git
cd sovereign-engine
# Build (fetches vk_mem_alloc.h automatically)
chmod +x scripts/build.sh
./scripts/build.sh
# Or with all options explicit:
./scripts/build.sh --release --tests --python --avx512mkdir build && cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DSOVEREIGN_BUILD_PYTHON=ON \
-DSOVEREIGN_BUILD_TESTS=ON \
-DSOVEREIGN_ENABLE_AVX512=ON
cmake --build . --parallel $(nproc)./scripts/build.sh --debugDownload a HuggingFace model (e.g. Gemma 4B) in SafeTensors format, then convert:
# Basic conversion – mixed quantisation targeting 4.5 bpw
./build/sovereign-convert \
--input /path/to/gemma-4b/ \
--output gemma-4b.sovereign \
--arch gemma \
--quant mixed \
--bpw 4.5
# With calibration corpus for better importance scoring
./build/sovereign-convert \
--input /path/to/gemma-4b/ \
--output gemma-4b-calibrated.sovereign \
--quant mixed \
--bpw 4.5 \
--calib calibration_corpus.txt \
--verboseQuantisation modes:
| Mode | Approx bpw | Description |
|---|---|---|
fp16 |
16 | No quantisation, maximum quality |
int8 |
8 | Symmetric INT8 throughout |
q4k |
4.5 | Q4_K block quantisation |
q3k |
3.5 | Q3_K block quantisation |
q2k |
2.6 | Q2_K aggressive compression |
mixed |
target | Adaptive per-tensor (recommended) |
You can also convert models directly inside Python without having to compile the C++ CLI tool:
import sovereign_inference
result = sovereign_inference.convert(
input_dir="/path/to/gemma-4b/",
output_path="gemma-4b.sovereign",
arch="gemma",
quant="mixed",
bpw=4.5,
verbose=True
)
if result["success"]:
print(f"Conversion successful! Achieved bpw: {result['achieved_bpw']:.2f}")
else:
print(f"Conversion failed: {result['error_message']}")import sovereign_inference
# Load model
cfg = sovereign_inference.LoadConfig()
cfg.gpu_layer_count = 2**31 - 1 # load everything into VRAM
cfg.kv_cache_vram_fraction = 0.80
with sovereign_inference.Engine.load("gemma-4b.sovereign", cfg) as engine:
print(f"Model : {engine.model_name}")
print(f"Device : {engine.device_name} ({engine.vram_gib:.1f} GiB)")
# --- Streaming generation ---
params = sovereign_inference.GenerateParams()
params.max_new_tokens = 512
params.sampling.temperature = 0.7
params.sampling.top_p = 0.9
params.sampling.min_p = 0.05
params.sampling.repetition_penalty = 1.1
stats = engine.generate(
prompt = "Explain quantum entanglement briefly:",
params = params,
callback = lambda tok, tid, lp: print(tok, end="", flush=True) or True,
)
print(f"\n[{stats.tokens_per_second:.1f} tok/s | {stats.generated_tokens} tokens]")
# --- Generator protocol ---
for text, token_id, logprob in engine.stream("Once upon a time", params):
print(text, end="", flush=True)
# --- Raw logits for custom sampling ---
ids = engine.tokenize("The sky is")
logits = engine.forward(ids) # numpy float32 array [vocab_size]#include "sovereign/engine.hpp"
int main() {
sovereign::LoadConfig cfg;
cfg.kv_cache_vram_fraction = 0.80;
auto engine = sovereign::Engine::load("gemma-4b.sovereign", cfg);
if (!engine) return 1;
sovereign::GenerateParams params;
params.max_new_tokens = 512;
params.sampling.temperature = 0.7f;
params.sampling.top_p = 0.9f;
auto stats = engine->generate(
"Explain quantum entanglement:",
params,
[](std::string_view tok, sovereign::TokenId, float) {
std::cout << tok << std::flush;
return true; // return false to stop early
});
std::fprintf(stderr, "\n%.1f tok/s\n", stats.tokens_per_second);
}#include "sovereign/engine.hpp" // exposes extern "C" block
SovereignEngine* engine = sovereign_engine_load(
"gemma-4b.sovereign",
0, // vram_budget (0 = auto)
~0u, // gpu_layers (all)
true // use_mmap
);
sovereign_engine_generate(
engine,
"Hello, world!",
0.7f, 0.9f, 0, 0.05f, 1.1f, // temperature, top_p, top_k, min_p, rep_penalty
256,
my_callback, NULL
);
sovereign_engine_free(engine);The .sovereign binary format is designed for zero-copy, memory-mapped inference:
┌──────────────┬──────────────────────────────────────────────────────┐
│ Offset │ Section │
├──────────────┼──────────────────────────────────────────────────────┤
│ 0x0000 │ FileHeader (256 bytes, fixed) │
│ 0x0100 │ ModelConfig (256 bytes, padded to 64B) │
│ aligned │ TokenizerBlob (UTF-8 JSON) │
│ aligned │ TensorIndex[] (N × 192 bytes each) │
│ PAGE-ALIGNED │ TensorDataBlock (mmap-ready, 4K page aligned) ◀──┐ │
└──────────────┴──────────────────────────────────────────────────────┘
│
Vulkan can mmap this block directly into a VkBuffer via │
VK_EXT_external_memory_host — zero CPU copy during weight loading. ───┘
Key properties:
- Magic bytes:
SVRN(0x53, 0x56, 0x52, 0x4E) - All multi-byte fields: little-endian
- Per-tensor CRC32C checksums (hardware-accelerated via SSE4.2)
- Per-tensor
DTypefield: supports F32, F16, BF16, INT8, INT4, INT3, INT2, Q4_K, Q3_K, Q2_K - Feature flags bitmask:
MMAP_READY,HAS_TOKENIZER,GROUPED_QUERY,RoPE_SCALED, …
The quantiser runs a 3-phase pipeline:
Computes per-tensor activation statistics on a small calibration corpus (≥ 512 tokens):
- Hessian proxy (mean squared activation magnitude)
- Outlier ratio (fraction with |w| > 3σ)
- Kurtosis (distribution peakedness)
Assigns a DType to each tensor to hit a target average bpw:
importance ≥ 0.75 → FP16 / INT8 (embeddings, first/last layers, norms)
importance ≥ 0.50 → INT4 / Q4_K (Q/K/V projections)
importance ≥ 0.25 → Q3_K
importance < 0.25 → Q2_K
Iteratively rebalances until |achieved_bpw - target_bpw| < 5%.
Per-block iterative solver minimising the Hessian-weighted MSE:
min_{scale, zero} ‖W − dequant(quant(W, scale, zero))‖²_H
Default: 20 iterations, block size 128 elements, FP16 scale storage.
The AsyncMemoryManager implements a double-buffered layer-streaming pipeline:
CPU Thread GPU Compute Queue DMA Transfer Queue
────────── ───────────────── ─────────────────
[Layer N-1 ready] ──▶ Compute(Layer N-1)
│
[Stream Layer N+1] ─────────────┼──────────▶ DMA(Layer N+1)
(from mmap/RAM) │ │
▼ ▼
Compute(Layer N) ◀── Layer N ready
VRAM pressure response:
-
88% → start evicting LRU layers (LRU free-list)
-
95% → force CPU offload via AVX-512 / NEON kernels
Inspired by vLLM's PagedAttention:
- One giant VRAM pool pre-allocated at startup (no per-block VkBuffer overhead).
- Block size: 16 tokens per block (configurable, must be power-of-2).
- Copy-on-write forking: beam search / speculative decoding shares blocks until a write occurs.
- Descriptor sets pre-allocated per
(block_id × layer)pair to avoid per-inference allocation. - Optional
ConstantContextCachefor RWKV / Mamba models (O(1) memory regardless of sequence length).
All shaders are compiled from GLSL (.comp) to SPIR-V at CMake configure time:
| Shader | Purpose |
|---|---|
rmsnorm.comp |
Fused RMSNorm with subgroup reduction; supports Gemma variant |
matmul_int4.comp |
Tiled INT4×FP16 GEMM with on-the-fly dequantisation and double-buffered B tiles |
attention_gqa.comp |
GQA/MHA/MQA fused attention: RoPE inline, PagedAttention block table, Flash-Attention tiled softmax |
silu_gate.comp |
Fused SwiGLU (SiLU × hadamard) for LLaMA/Gemma FFN |
sampler.comp |
GPU-resident sampling: temperature → top-K → softmax → top-P → min-P → multinomial |
All shaders use GL_EXT_scalar_block_layout and GL_KHR_shader_subgroup_arithmetic for efficient subgroup reductions.
# Build and run all tests
./scripts/build.sh --tests
cd build && ctest --output-on-failure
# Run a specific suite
./build/test_quantizer --success
./build/test_format --success
./build/test_kv_cache --success
./build/test_engine --success
# Integration test (requires a converted model)
SOVEREIGN_TEST_MODEL=gemma-4b.sovereign ctest -R test_integration- Continuous batching — interleave multiple requests in a single GPU pass
- Speculative decoding — draft model integration for 2-4× decode speedup
- Cooperative matrix — VK_KHR_cooperative_matrix path for tensor-core acceleration
- io_uring Direct Storage — bypass staging buffers for PCIe 4.0+ NVMe
- Rust bindings — PyO3 alternative to pybind11
- Windows support — MinGW + Vulkan SDK on Windows
- Web UI — minimal OpenAI-compatible HTTP server (compatible with llama.cpp clients)
- LoRA / adapter merging — runtime LoRA weight injection without repack
- RWKV / Mamba — constant-memory inference via
ConstantContextCache - Benchmark suite — automated comparison vs llama.cpp on standard prompts
Contributions are welcome. Please open an issue before submitting large pull requests.
# Fork, clone, then create a feature branch
git checkout -b feat/my-feature
# Build with tests + debug symbols
./scripts/build.sh --debug --tests
# Make sure all tests pass before submitting
cd build && ctest --output-on-failureCode style: follow the existing C++20 conventions (no exceptions in hot paths, [[nodiscard]] everywhere, PIMPL for public headers, RAII for all Vulkan handles).
MIT License — see LICENSE for details.
Sovereign Engine is an independent project and is not affiliated with Google, NVIDIA, AMD, or any model vendor.