Skip to content

corbac10099/Sovereign-Engine

Repository files navigation

Sovereign Engine

Ultra-fast, modular LLM inference engine with a Vulkan compute backend
Designed to surpass llama.cpp in throughput and VRAM efficiency.

┌──────────────────────────────────────────────────────────────────────────┐
│  Sovereign Engine  v0.2.10                                               │
│  C++20 · Vulkan 1.3 · SPIR-V Compute · pybind11 · Mixed INT4 Quant      │
└──────────────────────────────────────────────────────────────────────────┘

Build License: MIT C++20 Vulkan 1.3


Table of Contents


Overview

Sovereign Engine is a from-scratch, GPU-first LLM inference runtime written in C++20.
It targets local inference on consumer hardware (NVIDIA/AMD/Intel) using Vulkan compute as the sole GPU backend, which means:

  • No CUDA dependency — runs on any Vulkan 1.2+ GPU.
  • Tight control over VRAM: paged KV cache, async layer streaming, dynamic CPU offload.
  • Mixed-precision quantisation inspired by EXL2 and HQQ — assign INT4/INT3/INT2 per-tensor based on measured sensitivity.
  • A clean Python API (via pybind11) and a stable C ABI for FFI from any language.

Key Features

Feature Details
Vulkan backend Compute-only, no graphics queue needed. Works on NVIDIA, AMD, Intel, ARM Mali.
Mixed-precision quantisation FP16 → INT8 → Q4_K → Q3_K → Q2_K per tensor, HQQ solver, EXL2-style importance scoring.
Async layer pipeline Double-buffered PCIe staging: GPU runs layer N while CPU DMA-copies layer N+1.
PagedAttention KV cache Block-based VRAM pool, copy-on-write forking, O(1) alloc/free.
Dynamic CPU offload Falls back to AVX-512 / NEON when VRAM pressure exceeds threshold.
Streaming generation Token-by-token callback; GIL-safe Python generator.
Rich sampling Temperature, Top-P, Top-K, Min-P, Repetition Penalty, Mirostat v1/v2, GBNF grammar, JSON schema.
Proprietary .sovereign format Page-aligned mmap, per-tensor CRC32C, zero-copy Vulkan upload.
GQA / MHA / MQA All attention variants supported via a single fused GLSL shader.
RoPE + sliding window Inline rotary embeddings, optional Mistral/Gemma sliding-window mask.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           Python / C++ / C                               │
│                    (sovereign_inference.Engine)                          │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │       engine.cpp         │  prefill / decode_step /
                    │    (inference loop)      │  generate / forward
                    └──┬──────┬───────┬───────┘
                       │      │       │
         ┌─────────────▼─┐ ┌──▼────┐ ┌▼──────────────────┐
         │  VulkanContext │ │Quant  │ │ AsyncMemoryManager │
         │  (device,      │ │izer   │ │ (layer streaming,  │
         │   pipelines,   │ │       │ │  CPU offload)      │
         │   cmd bufs)    │ └───────┘ └────────────────────┘
         └───────┬────────┘                    │
                 │                   ┌─────────▼──────────┐
     ┌───────────▼────────────┐      │  PagedKVCache       │
     │  SPIR-V Compute Shaders│      │  (block pool,       │
     │  ┌─────────────────┐   │      │   CoW fork,         │
     │  │ rmsnorm.comp    │   │      │   descriptor sets)  │
     │  │ matmul_int4.comp│   │      └────────────────────┘
     │  │ attention_gqa   │   │
     │  │ silu_gate.comp  │   │
     │  │ sampler.comp    │   │
     │  └─────────────────┘   │
     └────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│                  sovereign-convert CLI                      │
│  SafeTensors → profile → budget allocate → HQQ quant       │
│               → pack INT4/3/2 → write .sovereign           │
└────────────────────────────────────────────────────────────┘

Project Structure

sovereign-engine/
├── CMakeLists.txt              # Root build configuration
├── README.md
├── .gitignore
│
├── include/sovereign/          # Public C++ headers
│   ├── engine.hpp              # Top-level inference API
│   ├── format.hpp              # .sovereign binary format spec
│   ├── vulkan_context.hpp      # Vulkan device + pipeline management
│   ├── memory_manager.hpp      # Async pipeline memory manager
│   ├── kv_cache.hpp            # PagedAttention KV cache
│   └── quantizer.hpp           # Mixed-precision quantiser
│
├── src/
│   ├── vulkan/
│   │   └── vulkan_context.cpp
│   ├── format/
│   │   └── format.cpp
│   ├── compute/
│   │   └── kv_cache.cpp
│   ├── inference/
│   │   └── engine.cpp
│   ├── quantizer/
│   │   └── quantizer.cpp
│   └── memory/
│       └── memory_manager.cpp
│
├── shaders/                    # GLSL compute shaders (compiled to SPIR-V)
│   ├── rmsnorm.comp
│   ├── matmul_int4.comp
│   ├── attention_gqa.comp
│   ├── silu_gate.comp
│   └── sampler.comp
│
├── bindings/
│   └── python/
│       └── sovereign_py.cpp    # pybind11 Python bindings
│
├── tools/
│   └── converter/
│       └── main.cpp            # sovereign-convert CLI
│
├── tests/
│   ├── CMakeLists.txt
│   ├── test_format.cpp
│   ├── test_quantizer.cpp
│   ├── test_kv_cache.cpp
│   └── test_engine.cpp
├── package.json                # Shader compiler package metadata
├── package-lock.json           # Shader compiler lock file
│
├── examples/
│   └── basic_generate.py       # Python streaming example
│
├── scripts/
│   ├── build.sh                # Build helper script
│   └── compile_shaders.js      # Shader compiler tool using WebGPU glslang
│
└── third_party/
    ├── volk/                   # Meta-loader for dynamic Vulkan loading (tracked)
    │   ├── volk.h
    │   └── volk.c
    └── vk_mem_alloc.h          # Fetched automatically via CMake (not tracked)

Requirements

Runtime

  • Vulkan 1.2+ Compatible GPU: Works on NVIDIA, AMD, Intel, Apple Silicon (via MoltenVK), and ARM Mali.
  • GPU Driver: Must support Vulkan 1.2 and the required extensions listed below. No SDK required at runtime!

Build (Zero-Dependency & SDK-Free)

Thanks to our dynamic meta-loader architecture (volk) and automatic CMake dependency management, the Vulkan SDK is completely optional to build Sovereign Engine!

Dependency Version Mandatory? Notes
CMake ≥ 3.25 Yes Handles the build orchestration
C++ Compiler C++20 Yes MSVC 2022 / GCC 12+ / Clang 15+
Vulkan SDK ≥ 1.3 No (Optional) If absent, CMake automatically fetches headers; uses precompiled SPIR-V shaders
Python ≥ 3.9 No (Optional) Only required to compile Python/pybind11 bindings

Required Vulkan Extensions

Your GPU driver must support:

VK_KHR_timeline_semaphore        (core in 1.2)
VK_KHR_synchronization2          (core in 1.3)
VK_EXT_memory_budget
VK_KHR_buffer_device_address
VK_KHR_shader_float16_int8
VK_EXT_scalar_block_layout
VK_KHR_8bit_storage
VK_KHR_16bit_storage

Building

Quick start

# Clone
git clone https://github.com/corbac10099/sovereign-engine.git
cd sovereign-engine

# Build (fetches vk_mem_alloc.h automatically)
chmod +x scripts/build.sh
./scripts/build.sh

# Or with all options explicit:
./scripts/build.sh --release --tests --python --avx512

Manual CMake

mkdir build && cd build
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DSOVEREIGN_BUILD_PYTHON=ON \
    -DSOVEREIGN_BUILD_TESTS=ON \
    -DSOVEREIGN_ENABLE_AVX512=ON
cmake --build . --parallel $(nproc)

Debug build with AddressSanitizer

./scripts/build.sh --debug

Usage

1. Convert a Model

Download a HuggingFace model (e.g. Gemma 4B) in SafeTensors format, then convert:

# Basic conversion – mixed quantisation targeting 4.5 bpw
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b.sovereign \
    --arch   gemma \
    --quant  mixed \
    --bpw    4.5

# With calibration corpus for better importance scoring
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b-calibrated.sovereign \
    --quant  mixed \
    --bpw    4.5 \
    --calib  calibration_corpus.txt \
    --verbose

Quantisation modes:

Mode Approx bpw Description
fp16 16 No quantisation, maximum quality
int8 8 Symmetric INT8 throughout
q4k 4.5 Q4_K block quantisation
q3k 3.5 Q3_K block quantisation
q2k 2.6 Q2_K aggressive compression
mixed target Adaptive per-tensor (recommended)

Python-based Conversion

You can also convert models directly inside Python without having to compile the C++ CLI tool:

import sovereign_inference

result = sovereign_inference.convert(
    input_dir="/path/to/gemma-4b/",
    output_path="gemma-4b.sovereign",
    arch="gemma",
    quant="mixed",
    bpw=4.5,
    verbose=True
)

if result["success"]:
    print(f"Conversion successful! Achieved bpw: {result['achieved_bpw']:.2f}")
else:
    print(f"Conversion failed: {result['error_message']}")

2. Python API

import sovereign_inference

# Load model
cfg = sovereign_inference.LoadConfig()
cfg.gpu_layer_count       = 2**31 - 1   # load everything into VRAM
cfg.kv_cache_vram_fraction = 0.80

with sovereign_inference.Engine.load("gemma-4b.sovereign", cfg) as engine:
    print(f"Model  : {engine.model_name}")
    print(f"Device : {engine.device_name}  ({engine.vram_gib:.1f} GiB)")

    # --- Streaming generation ---
    params = sovereign_inference.GenerateParams()
    params.max_new_tokens        = 512
    params.sampling.temperature  = 0.7
    params.sampling.top_p        = 0.9
    params.sampling.min_p        = 0.05
    params.sampling.repetition_penalty = 1.1

    stats = engine.generate(
        prompt   = "Explain quantum entanglement briefly:",
        params   = params,
        callback = lambda tok, tid, lp: print(tok, end="", flush=True) or True,
    )
    print(f"\n[{stats.tokens_per_second:.1f} tok/s | {stats.generated_tokens} tokens]")

    # --- Generator protocol ---
    for text, token_id, logprob in engine.stream("Once upon a time", params):
        print(text, end="", flush=True)

    # --- Raw logits for custom sampling ---
    ids    = engine.tokenize("The sky is")
    logits = engine.forward(ids)   # numpy float32 array [vocab_size]

3. C++ API

#include "sovereign/engine.hpp"

int main() {
    sovereign::LoadConfig cfg;
    cfg.kv_cache_vram_fraction = 0.80;

    auto engine = sovereign::Engine::load("gemma-4b.sovereign", cfg);
    if (!engine) return 1;

    sovereign::GenerateParams params;
    params.max_new_tokens       = 512;
    params.sampling.temperature = 0.7f;
    params.sampling.top_p       = 0.9f;

    auto stats = engine->generate(
        "Explain quantum entanglement:",
        params,
        [](std::string_view tok, sovereign::TokenId, float) {
            std::cout << tok << std::flush;
            return true;   // return false to stop early
        });

    std::fprintf(stderr, "\n%.1f tok/s\n", stats.tokens_per_second);
}

4. C API (FFI)

#include "sovereign/engine.hpp"   // exposes extern "C" block

SovereignEngine* engine = sovereign_engine_load(
    "gemma-4b.sovereign",
    0,       // vram_budget (0 = auto)
    ~0u,     // gpu_layers  (all)
    true     // use_mmap
);

sovereign_engine_generate(
    engine,
    "Hello, world!",
    0.7f, 0.9f, 0, 0.05f, 1.1f,  // temperature, top_p, top_k, min_p, rep_penalty
    256,
    my_callback, NULL
);

sovereign_engine_free(engine);

The .sovereign Format

The .sovereign binary format is designed for zero-copy, memory-mapped inference:

┌──────────────┬──────────────────────────────────────────────────────┐
│ Offset       │ Section                                              │
├──────────────┼──────────────────────────────────────────────────────┤
│ 0x0000       │ FileHeader        (256 bytes, fixed)                 │
│ 0x0100       │ ModelConfig       (256 bytes, padded to 64B)         │
│ aligned      │ TokenizerBlob     (UTF-8 JSON)                       │
│ aligned      │ TensorIndex[]     (N × 192 bytes each)               │
│ PAGE-ALIGNED │ TensorDataBlock   (mmap-ready, 4K page aligned) ◀──┐ │
└──────────────┴──────────────────────────────────────────────────────┘
                                                                       │
Vulkan can mmap this block directly into a VkBuffer via               │
VK_EXT_external_memory_host — zero CPU copy during weight loading. ───┘

Key properties:

  • Magic bytes: SVRN (0x53, 0x56, 0x52, 0x4E)
  • All multi-byte fields: little-endian
  • Per-tensor CRC32C checksums (hardware-accelerated via SSE4.2)
  • Per-tensor DType field: supports F32, F16, BF16, INT8, INT4, INT3, INT2, Q4_K, Q3_K, Q2_K
  • Feature flags bitmask: MMAP_READY, HAS_TOKENIZER, GROUPED_QUERY, RoPE_SCALED, …

Quantiser

The quantiser runs a 3-phase pipeline:

Phase 1 – Calibration Profiling

Computes per-tensor activation statistics on a small calibration corpus (≥ 512 tokens):

  • Hessian proxy (mean squared activation magnitude)
  • Outlier ratio (fraction with |w| > 3σ)
  • Kurtosis (distribution peakedness)

Phase 2 – Budget Allocation

Assigns a DType to each tensor to hit a target average bpw:

importance ≥ 0.75  →  FP16 / INT8   (embeddings, first/last layers, norms)
importance ≥ 0.50  →  INT4 / Q4_K  (Q/K/V projections)
importance ≥ 0.25  →  Q3_K
importance <  0.25  →  Q2_K

Iteratively rebalances until |achieved_bpw - target_bpw| < 5%.

Phase 3 – HQQ Quantisation

Per-block iterative solver minimising the Hessian-weighted MSE:

min_{scale, zero} ‖W − dequant(quant(W, scale, zero))‖²_H

Default: 20 iterations, block size 128 elements, FP16 scale storage.


Memory Manager

The AsyncMemoryManager implements a double-buffered layer-streaming pipeline:

CPU Thread             GPU Compute Queue       DMA Transfer Queue
──────────             ─────────────────       ─────────────────

[Layer N-1 ready] ──▶  Compute(Layer N-1)
                                │
[Stream Layer N+1] ─────────────┼──────────▶ DMA(Layer N+1)
  (from mmap/RAM)               │                  │
                                ▼                  ▼
                        Compute(Layer N) ◀── Layer N ready

VRAM pressure response:

  • 88% → start evicting LRU layers (LRU free-list)

  • 95% → force CPU offload via AVX-512 / NEON kernels


KV Cache (PagedAttention)

Inspired by vLLM's PagedAttention:

  • One giant VRAM pool pre-allocated at startup (no per-block VkBuffer overhead).
  • Block size: 16 tokens per block (configurable, must be power-of-2).
  • Copy-on-write forking: beam search / speculative decoding shares blocks until a write occurs.
  • Descriptor sets pre-allocated per (block_id × layer) pair to avoid per-inference allocation.
  • Optional ConstantContextCache for RWKV / Mamba models (O(1) memory regardless of sequence length).

Vulkan Compute Shaders

All shaders are compiled from GLSL (.comp) to SPIR-V at CMake configure time:

Shader Purpose
rmsnorm.comp Fused RMSNorm with subgroup reduction; supports Gemma variant
matmul_int4.comp Tiled INT4×FP16 GEMM with on-the-fly dequantisation and double-buffered B tiles
attention_gqa.comp GQA/MHA/MQA fused attention: RoPE inline, PagedAttention block table, Flash-Attention tiled softmax
silu_gate.comp Fused SwiGLU (SiLU × hadamard) for LLaMA/Gemma FFN
sampler.comp GPU-resident sampling: temperature → top-K → softmax → top-P → min-P → multinomial

All shaders use GL_EXT_scalar_block_layout and GL_KHR_shader_subgroup_arithmetic for efficient subgroup reductions.


Running Tests

# Build and run all tests
./scripts/build.sh --tests
cd build && ctest --output-on-failure

# Run a specific suite
./build/test_quantizer --success
./build/test_format    --success
./build/test_kv_cache  --success
./build/test_engine    --success

# Integration test (requires a converted model)
SOVEREIGN_TEST_MODEL=gemma-4b.sovereign ctest -R test_integration

Roadmap

  • Continuous batching — interleave multiple requests in a single GPU pass
  • Speculative decoding — draft model integration for 2-4× decode speedup
  • Cooperative matrix — VK_KHR_cooperative_matrix path for tensor-core acceleration
  • io_uring Direct Storage — bypass staging buffers for PCIe 4.0+ NVMe
  • Rust bindings — PyO3 alternative to pybind11
  • Windows support — MinGW + Vulkan SDK on Windows
  • Web UI — minimal OpenAI-compatible HTTP server (compatible with llama.cpp clients)
  • LoRA / adapter merging — runtime LoRA weight injection without repack
  • RWKV / Mamba — constant-memory inference via ConstantContextCache
  • Benchmark suite — automated comparison vs llama.cpp on standard prompts

Contributing

Contributions are welcome. Please open an issue before submitting large pull requests.

# Fork, clone, then create a feature branch
git checkout -b feat/my-feature

# Build with tests + debug symbols
./scripts/build.sh --debug --tests

# Make sure all tests pass before submitting
cd build && ctest --output-on-failure

Code style: follow the existing C++20 conventions (no exceptions in hot paths, [[nodiscard]] everywhere, PIMPL for public headers, RAII for all Vulkan handles).


License

MIT License — see LICENSE for details.


Sovereign Engine is an independent project and is not affiliated with Google, NVIDIA, AMD, or any model vendor.

About

Ultra-fast LLM inference engine — Vulkan backend, no CUDA required, AMD/Intel/NVIDIA

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors