ThemisDB Architecture Documentation

Overview

ThemisDB is a high-performance, multi-model database system that integrates relational, graph, vector, and document models with native AI/LLM capabilities. The architecture is organized into modular, namespace-organized components that work together to provide a complete enterprise database solution.

Core Principles:

Modularity: Optional components, selectable at build time (MINIMAL, COMMUNITY, ENTERPRISE, HYPERSCALER editions)
Layered Architecture: Clear separation between API, Query, Storage, and Distributed concerns
Namespace Organization: Logical grouping using C++ namespaces (themis::*)
High Performance: GPU acceleration, SIMD optimizations, adaptive indexing
Enterprise Ready: ACID transactions, encryption, audit logging, observability

Main Directory Structure

`/src/` - Implementation (44 Core Components)

Directory	Purpose	Key Classes
acceleration/	GPU & hardware backends (CUDA, HIP, Vulkan, OpenCL)	CudaBackend, HipBackend, VulkanBackend
analytics/	Process mining, OLAP, diff engine, NLP analysis	OlapEngine, DiffEngine, ProcessAnalyzer
api/	GraphQL API, HTTP server setup	GraphQLAPI
aql/	AQL-specific handlers and assistant functions	LlmAqlHandler, DocsAssistant
auth/	Authentication (JWT, GSSAPI, MFA)	JWTValidator, GSSAPIAuthenticator
base/	Core module loader and initialization	ModuleLoader
cache/	Semantic caching, query caching, embedding caching	SemanticCache, AdaptiveQueryCache
cdc/	Change Data Capture and changefeeds	ChangeFeed, ChangeBuffer
chimera/	Adapter factory for database compatibility	ThemisDBAdapter, IDatabaseAdapter
config/	Backward-compatible config path resolution, LRU caching, JSON Schema validation	ConfigPathResolver, ConfigSchemaValidator, ConfigAuditLog
content/	Multimodal ingestion (PDF, images, audio, video, CAD)	ContentManager, AsyncIngestionWorker
core/	Security initialization, concerns context (logging, tracing)	ConcernsContext, SecurityInit
exporters/	Data export in various formats	JsonlLlmExporter
geo/	Geospatial query processing and indexing	SpatialBackend, GpuBackend
governance/	Policy engine, compliance, versioning	PolicyEngine, ComplianceReporter
graph/	Property graphs, graph indexing, path constraints	PropertyGraph, GraphIndex
gpu/	GPU-specific memory and acceleration	GpuMemoryManager
importers/	Data import (PostgreSQL, etc.)	PostgresImporter
index/	Vector indexing (HNSW, quantization), graph indices	VectorIndex, GraphIndex, HnswIndex
ingestion/	Multi-source data intake (filesystem, HuggingFace, REST API), rate limiting, checkpointing	IngestionManager, FileSystemIngester, HuggingFaceConnector
llm/	LLM integration, inference, LoRA, embeddings, vision	EmbeddedLlm, LoraFramework, FlashAttention
metadata/	Schema management	SchemaManager
network/	Wire protocol, socket management	WireProtocolServer
observability/	Metrics, profiling, alerting	MetricsCollector, QueryProfiler
performance/	Advanced data structures (RCU, LIRS, lock-free buffers)	PerformanceOptimizations
plugins/	Plugin system, hot-plugging, RPC interfaces	PluginManager, PluginRegistry
prompt_engineering/	Prompt template lifecycle, version control, A/B testing, self-optimization, injection detection	PromptManager, PromptOptimizer, SelfImprovementOrchestrator, PromptInjectionDetector
query/	AQL parser, optimizer, execution engine	QueryEngine, AqlParser, QueryOptimizer
rag/	RAG evaluation (faithfulness, relevance, bias detection)	RagJudge, CoherenceEvaluator
replication/	Multi-master replication	ReplicationManager
scheduler/	Task scheduling, retention management	TaskScheduler, HybridRetentionManager
search/	Hybrid search (vector + full-text)	HybridSearch
security/	Encryption, key management, PKI, RBAC, audit	RbacManager, FieldEncryption, KeyProvider
server/	HTTP/gRPC servers, 40+ API handlers, rate limiting, tenant management	HttpServer, ApiGateway, QueryApiHandler, RateLimiter, TenantManager
sharding/	Horizontal scaling, consensus (Raft/Paxos/Gossip)	ShardRouter, RaftConsensus, DistributedCoordinator
storage/	RocksDB wrapper, compression, blob storage, transactions	StorageEngine, BlobStorageManager
temporal/	Conflict resolution for temporal data	TemporalConflictResolver
timeseries/	Time series compression (Gorilla), aggregates, retention	TimeSeriesManager, GorillaEncoder, GorillaDecoder
training/	Domain-specific LLM fine-tuning, LoRA adapter management, knowledge graph enrichment	LegalAutoLabeler, IncrementalLoRATrainer, KnowledgeGraphEnricher
transaction/	ACID transactions, SAGA pattern, branching	TransactionManager, SagaManager
updates/	Hot reload, manifest management, version control	HotReloadEngine, ReleaseManifest
utils/	Logging, PII detection, compression, utilities	Logger, PiiDetector, Serialization
voice/	Voice assistant integration	VoiceAssistant

`/include/` - Public Headers

Headers organized by component with consistent namespace patterns covering query engines, storage, sharding, LLM frameworks, indexing, security, servers, content processing, and governance.

Architectural Layers

1. API & Protocol Layer (Server Tier)

Namespace: themis::server::*

Handles multiple protocol frontends:

HTTP/2/3: REST API, GraphQL endpoint
gRPC: Binary protocol for high-performance clients
WebSocket: Real-time streaming and subscriptions
PostgreSQL Wire Protocol: PostgreSQL client compatibility
MQTT: IoT device integration
Binary Wire Protocol: Custom high-performance protocol

Key Components:

40+ specialized API handlers for different domains (query, storage, LLM, geo, graph, etc.)
API Gateway with authentication, rate limiting (V1/V2), load shedding
Tenant management with resource quotas and isolation
SSE connection manager for changefeed streaming with rate limits
Request routing and protocol translation

2. Query Processing Layer

Namespace: themis::query::*

Complete SQL-like query processing pipeline:

Request → Parser → Translator → Optimizer → Executor → ResultStream

Features:

AQL Parser: Parse Advanced Query Language queries
Query Optimizer: Cost-based optimization with learned models
Execution Engine: Streaming execution with pipelining
100+ Built-in Functions: Across 12 categories (vector, graph, geo, string, math, etc.)
CTE Support: Common Table Expressions with caching
Window Functions: Ranking, aggregation over partitions
UDF Support: User-defined functions

Function Categories:

Vector operations (similarity, embeddings)
Graph algorithms (traversal, shortest path, community detection)
Geospatial (distance, containment, spatial operations)
String manipulation (concatenation, regex, NLP)
Mathematical (arithmetic, statistical)
Relational (joins, aggregations, window functions)
Array operations
Date/time functions
AI/ML functions
Ethics functions (fairness metrics, bias detection)
Security functions (encryption, hashing)
LoRA-specific operations

3. Index & Vector Layer

Namespace: themis::index::*

Advanced indexing for multi-model data:

Vector Indexing:

HNSW (Hierarchical Navigable Small World) index
GPU-accelerated vector search (CUDA, HIP, Vulkan)
Quantization support (product quantization, scalar quantization)
Faiss integration for large-scale vector search
Hybrid search combining vector and full-text

Graph Indexing:

Property graph index
Path constraint optimization
Community detection indices

Spatial Indexing:

R-tree for 2D/3D spatial queries
H3 hexagonal hierarchical indexing
GPU-accelerated spatial operations

4. LLM Integration Layer

Namespace: themis::llm::*

Native large language model capabilities integrated directly into the database:

Core Components:

EmbeddedLlm: Native llama.cpp integration for inference
LoRA Framework: Multi-GPU training with NCCL/RCCL
Flash Attention: Optimized attention mechanisms (CUDA, HIP, Vulkan)
Vision Processing: Image and video understanding, CLIP integration
RAG Evaluation: Faithfulness, coherence, relevance, bias detection

Features:

Async inference engine with continuous batching
Paged KV-cache for memory efficiency
Prefix caching for repeated prompts
Adaptive VRAM allocation
Mixed precision training (FP16, BF16, INT8)
Quantized model support (GGUF format)
Grammar-constrained generation
Multi-LoRA adapter management
Model hot-swapping
Ethical guidelines enforcement

LoRA Framework Components (40+):

Multi-GPU distributed training
Gradient checkpointing
Mixed precision optimization
Quantization-aware training
Adaptive batching
Resource profiling
GPU memory management
Model compatibility checks
Audit logging
Feedback collection

5. Storage Layer

Namespace: themis::storage::*

Robust persistent storage built on RocksDB:

Features:

RocksDB-based key-value store
Compression (LZ4, Zstd, Snappy)
Field-level encryption
Blob storage backends:
- S3-compatible storage
- Azure Blob Storage
- WebDAV
- Local filesystem
Erasure coding for redundancy
Write-ahead logging (WAL)
Snapshot isolation

Storage Engine:

Column families for data organization
Bloom filters for fast lookups
LSM tree optimization
Compaction strategies
Cache management

6. Distributed & Sharding Layer

Namespace: themis::sharding::*

Horizontal scaling and distributed coordination:

Consensus Algorithms:

Raft: Leader-based consensus for strong consistency
Paxos: Leaderless consensus for fault tolerance
Gossip: Eventual consistency for high availability

Features:

Cross-shard transactions
Distributed query execution
Cluster management
Health monitoring
Automatic failover
Data rebalancing
Geo-sharding support

Components:

ShardRouter: Query routing to responsible shards
DistributedCoordinator: Cluster state management
ConsensusFactory: Pluggable consensus implementations
HealthMonitor: Node health tracking

7. Transaction Layer

Namespace: themis::transaction::*

ACID guarantees for reliable data operations:

Features:

MVCC (Multi-Version Concurrency Control)
Snapshot isolation
SAGA pattern for distributed transactions
Transaction branching
Versioning and time-travel queries
Optimistic concurrency control
Deadlock detection
Automatic rollback on failure

Transaction Manager:

Begin/commit/rollback operations
Lock management
Conflict resolution
Transaction log
Recovery mechanisms

8. Content & Data Processing Layer

Namespace: themis::content::*

Multimodal data ingestion and processing:

Supported Formats:

Documents: PDF, Word, Excel, PowerPoint
Images: JPEG, PNG, TIFF, WebP
Audio: MP3, WAV, FLAC
Video: MP4, AVI, MKV
CAD: DWG, DXF, STL
Archives: ZIP, TAR, GZ

Features:

Async processing pipelines
Bulk upload support
Content version management
Metadata extraction (ML-based)
Document classification
Text extraction (OCR)
Image analysis
Audio transcription

9. Analytics & Observability Layer

Namespace: themis::analytics::*, themis::observability::*

Business intelligence and system monitoring:

Analytics:

OLAP queries with columnar processing
Process mining and workflow analysis
Diff engine for change analysis
NLP-based text analytics
Statistical analysis functions

Observability:

Metrics collection (Prometheus-compatible)
Query profiling
Performance monitoring
Alerting system
Distributed tracing (OpenTelemetry)
Log aggregation
Health checks

10. Governance & Compliance Layer

Namespace: themis::governance::*

Policy enforcement and regulatory compliance:

Features:

Policy engine for data governance
Compliance reporting (GDPR, HIPAA, SOC2)
Data lineage tracking
Version control for schema and policies
Automated policy reviews
Audit trail
Data retention policies
PII detection and masking

11. Configuration & Ingestion Layer

Namespaces: themis::config::*, themis::ingestion::*

Infrastructure for configuration management and data intake:

Config:

Backward-compatible legacy-to-new config path resolution (50+ mapped paths)
LRU cache with configurable capacity and TTL for resolved paths
Path traversal prevention and symlink escape detection
JSON Schema (Draft 7) validation for YAML/JSON configuration files
Typed exception hierarchy (ConfigNotFoundException, InvalidPathException)
Deprecation metadata with migration guide URLs per path
Prometheus metrics export via /metrics endpoint

Ingestion:

Multi-source document intake: filesystem, HuggingFace datasets, generic REST APIs
Parallel source orchestration via thread pool with configurable concurrency
Token-bucket rate limiting per source
Incremental checkpoint-based ingestion (skip already-processed records)
Quarantine queue for persistently failing documents with per-record retry
Dry-run mode for pipeline validation without database writes
Binary MIME detection (PDF, DOCX) before content dispatch
Prometheus-compatible throughput and error metrics

12. Prompt Engineering & Training Layer

Namespaces: themis::prompt_engineering::*, themis::training::*

Lifecycle management for LLM prompts and domain-specific fine-tuning adapters:

Prompt Engineering:

Template CRUD with RocksDB persistence and YAML bulk-load
Git-like version control: branches, commits, diffs, rollback
Iterative prompt optimization via meta-prompts and feedback loops
A/B testing with statistical significance (Welch's t-test / normal CDF)
Self-improvement orchestrator: background thread auto-detects underperforming templates
Prompt injection detection and sanitization (10+ built-in patterns)
Prometheus metrics with crash-safe snapshot/restore
Integration façade combining all subsystems

Training:

LegalAutoLabeler: NLP modality extraction from domain documents (legal, multi-language)
IncrementalLoRATrainer: LoRA adapter training with checkpoint/resume, configurable rank/alpha/lr
KnowledgeGraphEnricher: AQL-based context enrichment via graph traversal
Adapter version management: deploy, rollback, traffic splitting
Confidence gating for human review of low-confidence training samples
Pimpl pattern for ABI stability across all components

Namespace Organization

Hierarchy

themis::                          # Primary root namespace (most components)
themisdb::                        # Secondary root namespace (sharding, replication, some query functions)
├── query::
│   ├── functions::               # Query functions (12+ categories)
│   │   ├── vector_functions
│   │   ├── graph_functions
│   │   ├── geo_functions
│   │   ├── ethics_functions
│   │   └── [8+ more categories]
│   ├── parser::                  # AQL parser
│   └── optimizer::               # Query optimization
├── llm::
│   ├── lora_framework::          # Multi-GPU LoRA training
│   │   ├── cuda::                # CUDA kernels
│   │   ├── hip::                 # AMD HIP kernels
│   │   ├── directx::             # DirectX compute
│   │   └── vulkan::              # Vulkan compute
│   ├── attention::               # Flash Attention implementations
│   ├── applications::            # LLM applications
│   └── security::                # LLM security validators
├── sharding::                    # Consensus & distributed coordination
│   ├── raft::                    # Raft consensus
│   ├── paxos::                   # Paxos consensus
│   └── gossip::                  # Gossip protocol
├── storage::                     # RocksDB & blob storage
│   ├── blob::                    # Blob storage backends
│   └── compression::             # Compression algorithms
├── index::                       # Vector, graph, spatial indices
│   ├── vector::                  # Vector indexing
│   ├── graph::                   # Graph indexing
│   └── spatial::                 # Spatial indexing
├── server::                      # API handlers & protocols
│   ├── rpc::                     # RPC handlers
│   └── handlers::                # Protocol-specific handlers
├── security::                    # Encryption & access control
│   ├── encryption::              # Encryption services
│   ├── rbac::                    # Role-based access control
│   └── audit::                   # Audit logging
├── content::
│   └── pipeline::                # Content processing pipelines
├── governance::                  # Policy & compliance
├── acceleration::                # GPU backends
│   ├── cuda::                    # NVIDIA CUDA
│   ├── hip::                     # AMD HIP
│   ├── vulkan::                  # Vulkan compute
│   └── opencl::                  # OpenCL
├── analytics::                   # OLAP & process mining
├── transaction::                 # Transaction management
├── auth::                        # Authentication
├── cache::                       # Caching layers
├── config::                      # Config path resolution & schema validation
├── geo::                         # Geospatial operations
├── graph::                       # Graph processing
├── ingestion::                   # Multi-source data intake pipeline
├── metadata::                    # Schema management
├── network::                     # Network protocols
├── observability::               # Monitoring & metrics
├── plugins::                     # Plugin system
├── prompt_engineering::          # Prompt template lifecycle & optimization
├── rag::                         # RAG evaluation
├── replication::                 # Data replication
├── scheduler::                   # Task scheduling
├── search::                      # Search functionality
├── temporal::                    # Temporal operations
├── timeseries::                  # Time series data
├── training::                    # Domain-specific LLM fine-tuning
├── updates::                     # Hot reload & updates
├── utils::                       # Utility functions
│   ├── geo::                     # Geo utilities
│   └── memory::                  # Memory utilities
└── voice::                       # Voice assistant

Key Architectural Patterns

1. Namespace Isolation

Each component lives in its own namespace, preventing naming conflicts and making dependencies explicit. This enables:

Clear component boundaries
Easy dependency tracking
Modular compilation
Independent testing

2. Interface-Based Design

Critical systems use abstract interfaces enabling pluggable implementations:

QueryInterface: Pluggable query engines
IndexInterface: Different indexing strategies
StorageInterface: Multiple storage backends
ConsensusInterface: Various consensus protocols

3. Consensus Abstraction

Different replication scenarios use pluggable consensus via ConsensusFactory:

RaftConsensus: Leader-based replication for strong consistency
PaxosConsensus: Leaderless replication for fault tolerance
GossipConsensus: Eventual consistency for high availability

Selection is based on:

Consistency requirements
Latency tolerance
Network partition behavior
Geographic distribution

4. SAGA Pattern

SagaManager coordinates multi-step distributed transactions with:

Automatic rollback on failure
Compensation actions for each step
Progress tracking
Idempotent operations
Retry mechanisms

5. Adaptive Optimization

QueryOptimizer uses cost-based planning with:

Learned cost models from query history
Adaptive index selection based on data distribution
Runtime plan adjustments
Statistics collection
Cardinality estimation

6. Plugin Architecture

PluginManager supports dynamic loading with:

Hot-reloading without downtime
Versioning and compatibility checks
Sandboxed execution
Plugin discovery
RPC interface for plugin communication

Types of plugins:

Content processors
LLM models and adapters
Custom query functions
Storage backends
Authentication providers

Request Flow

Complete Request Path

Client (HTTP/gRPC/WebSocket/MQTT)
    ↓
┌─────────────────────────────────────┐
│  API Gateway & Middleware           │
│  - Authentication (JWT/GSSAPI/MFA)  │
│  - Rate Limiting                    │
│  - Load Shedding                    │
│  - Request Logging                  │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  Query Parser (AqlParser)           │
│  - Lexical analysis                 │
│  - Syntax parsing                   │
│  - Semantic validation              │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  Query Optimizer                    │
│  - Cost-based optimization          │
│  - Index selection                  │
│  - Join ordering                    │
│  - Predicate pushdown               │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  Execution Engine                   │
│  - Operator pipeline                │
│  - Result streaming                 │
└─────────────────────────────────────┘
    ↓
        ┌────────────────┐
        │ Routing?       │
        │ Local/Remote   │
        └────────────────┘
          ↓           ↓
    [Local]      [Remote Shard]
          ↓           ↓
    ┌──────────────────────┐
    │  Index Selection     │
    │  - Vector (HNSW)     │
    │  - Graph             │
    │  - Spatial (R-tree)  │
    │  - or Direct Storage │
    └──────────────────────┘
          ↓
    ┌──────────────────────┐
    │  Storage Engine      │
    │  (RocksDB)           │
    │  + Cache Layer       │
    └──────────────────────┘
          ↓
    ┌──────────────────────┐
    │  Replication         │
    │  Raft/Paxos/Gossip   │
    └──────────────────────┘
          ↓
    ┌──────────────────────┐
    │  Persistence         │
    │  (Disk + WAL)        │
    └──────────────────────┘

Query Execution Flow Details

Request Reception: Protocol-specific server receives request
Authentication: Validate credentials, check permissions
Rate Limiting: Enforce request rate limits per client (V1 token bucket or V2 priority lanes)
Tenant Quota Check: Verify tenant resource quotas (connections, queries, storage)
Parsing: Convert query string to AST (Abstract Syntax Tree)
Validation: Check schema, permissions, syntax
Optimization: Generate optimal execution plan
Execution: Execute plan with pipelining
Index Usage: Utilize appropriate indices for fast lookups
Storage Access: Read/write data from/to RocksDB
Replication: Replicate writes to other nodes (if configured)
Result Formatting: Convert internal format to requested format
Response: Send results back to client

Resource Limits & Protection

ThemisDB implements comprehensive resource limits to ensure system stability and fair resource allocation:

Rate Limiting:

V1 (Legacy): Token bucket with per-IP and per-user limits
V2 (Preferred): Priority lanes (HIGH/NORMAL/LOW) for VIP client support
Configurable burst capacity and sustained rates
Automatic idle client cleanup

Tenant Quotas:

Per-tenant resource isolation (storage, documents, collections)
Connection limits (default: 50 per tenant)
Concurrent query limits (default: 100 per tenant)
Rate limits per tenant (default: 1000 req/s)
Enforced at HTTP server and API Gateway layers

SSE/Changefeed Limits:

Per-connection rate caps (max events/second)
Buffer limits (default: 1000 events per connection)
Heartbeat mechanism (15s interval) to prevent timeouts
Configurable overflow policy (drop oldest/newest)

Connection Limits:

Max concurrent requests (default: 1000)
HTTP/2 stream limits (default: 100)
Request body size limits
Load shedding at 90% capacity

See RESOURCE_LIMITS_GUIDE.md for detailed configuration and best practices.

Edition Differences

ThemisDB offers different build editions to suit various deployment scenarios:

MINIMAL

Basic database functionality without advanced features.

Components:

Core query engine
Storage layer (RocksDB)
Basic indexing
HTTP API
Authentication

Use Cases:

Development and testing
Embedded applications
Resource-constrained environments

COMMUNITY

Adds replication and basic AI capabilities.

Additional Components:

Raft replication
llama.cpp LLM integration
Vector indexing (CPU-only)
GraphQL API
Audit logging

Use Cases:

Small to medium deployments
Basic AI workloads
Open-source projects

ENTERPRISE

Full-featured edition with advanced AI and security.

Additional Components:

GPU acceleration (CUDA, HIP, Vulkan)
LoRA training framework
Field-level encryption
RBAC and MFA
Governance and compliance tools
Paxos and Gossip consensus
Change Data Capture
Content processing (all formats)
Advanced observability

Use Cases:

Production deployments
Regulated industries
Complex AI workloads
Multi-region deployments

HYPERSCALER

Maximum scale and resilience for cloud deployments.

Additional Components:

GPU erasure coding
Predictive failure detection
Geo-sharding with cross-region coordination
Advanced load balancing
Automated scaling
Cost optimization

Use Cases:

Cloud-native deployments
Global scale applications
Mission-critical systems
Multi-cloud strategies

Build Configuration: Editions are selected via CMake:

cmake -B build -DTHEMISDB_EDITION=ENTERPRISE

Performance Characteristics

Benchmarks (Single Node)

Operation	Throughput	Latency (p99)
Writes	45,000 ops/s	8ms
Reads	120,000 ops/s	2ms
Vector Search (CPU)	5,000 queries/s	15ms
Vector Search (GPU)	25,000 queries/s	3ms
Graph Traversal	10,000 queries/s	10ms

Scalability

Horizontal Scaling: Linear scale-out to 100+ nodes
Storage: Petabyte-scale with blob storage backends
Concurrent Connections: 100,000+ with connection pooling
Transaction Rate: 1M+ transactions/second (clustered)

Optimization Techniques

SIMD: Vectorized operations for data processing
GPU Acceleration: CUDA, HIP, Vulkan for compute-intensive tasks
Lock-Free Structures: High-concurrency data structures
Cache Optimization: Multi-level caching (L1, L2, distributed)
Compression: Reduces storage and network overhead
Pipelining: Overlapped execution stages
Adaptive Indexing: Automatic index creation based on workload

Security Features

Encryption

TLS 1.3: Encrypted network communication
Field-Level Encryption: Encrypt sensitive fields at rest
Key Management: Integration with HSM and key vaults
Certificate Management: Automatic certificate rotation

Authentication

JWT: Token-based authentication
GSSAPI: Kerberos integration
MFA: Multi-factor authentication
OAuth2/OIDC: Third-party authentication

Authorization

RBAC: Role-based access control with fine-grained permissions
Column-Level Security: Restrict access to specific columns
Row-Level Security: Filter data based on user context
Query-Level Policies: Enforce policies at query time

Audit & Compliance

Audit Logging: Comprehensive audit trail for all operations
PII Detection: Automatic detection and masking of sensitive data
Compliance Reports: GDPR, HIPAA, SOC2 reporting
Data Lineage: Track data origin and transformations

Starting Points for Exploration

For Developers

Query Execution: include/query/query_engine.h
- Understand how queries are parsed, optimized, and executed
Storage Layer: include/storage/storage_engine.h
- Explore RocksDB integration and storage abstractions
Sharding: include/sharding/distributed_coordinator.h
- Learn about distributed query execution and coordination
LLM Integration: include/llm/embedded_llm.h
- Discover native LLM capabilities
API Handlers: include/server/*_api_handler.h
- See how different APIs are implemented

For Operations

Configuration: config/ directory
- Server configuration, clustering setup
Deployment: deploy/ and helm/ directories
- Docker, Kubernetes deployment manifests
Monitoring: prometheus/ and grafana/ directories
- Metrics and dashboards
Security: security/ directory
- Certificate management, key configuration

For Build System

CMake Build: CMakeLists.txt
- Main build configuration
Edition Selection: cmake/editions/
- Different build editions and feature flags
Feature Modules: cmake/features/
- Optional feature configuration
Dependencies: vcpkg.json
- Dependency management

Integration Points

Client Libraries

C++: Native client
Python: themisdb-python SDK
JavaScript/TypeScript: themisdb-js SDK
Java: JDBC driver
Go: Go client library
Rust: Rust client library

External Systems

PostgreSQL: Wire protocol compatibility
S3: Storage backend integration
Prometheus: Metrics export
Grafana: Visualization dashboards
OpenTelemetry: Distributed tracing
Kafka: Event streaming (via CDC)

Plugins

Content processors (custom document formats)
LLM models (custom models via llama.cpp)
Authentication providers (custom auth backends)
Storage backends (custom storage systems)

Development Workflow

Building from Source

# Clone repository
git clone https://github.com/makr-code/ThemisDB.git
cd ThemisDB

# Configure with vcpkg
cmake -B build -DCMAKE_TOOLCHAIN_FILE=vcpkg/scripts/buildsystems/vcpkg.cmake

# Build
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build

Running Locally

# Start server
./build/themisdb --config config/config.yaml

# Or with Docker
docker-compose -f docker/docker-compose.yml up -d

Testing

Unit Tests: tests/unit/
Integration Tests: tests/integration/
Performance Tests: benchmarks/
Fuzz Tests: fuzz/

Contributing

When contributing to ThemisDB architecture:

Namespace Consistency: Follow existing namespace patterns
Interface Design: Use abstract interfaces for pluggability
Documentation: Update this file when adding major components
Testing: Add tests for new functionality
Performance: Benchmark performance-critical changes
Security: Consider security implications

See CONTRIBUTING.md for detailed guidelines.

Visual Architecture Diagrams

Component Interaction Diagram

graph TB
    subgraph "Client Layer"
        C1[HTTP/REST Client]
        C2[gRPC Client]
        C3[WebSocket Client]
        C4[PostgreSQL Client]
    end
    
    subgraph "API Layer - themis::server::"
        API[API Gateway]
        AUTH[Authentication]
        RATE[Rate Limiter]
    end
    
    subgraph "Query Layer - themis::query::"
        PARSER[AQL Parser]
        OPT[Query Optimizer]
        EXEC[Execution Engine]
    end
    
    subgraph "Index Layer - themis::index::"
        VIDX[Vector Index HNSW]
        GIDX[Graph Index]
        SIDX[Spatial Index]
    end
    
    subgraph "Storage Layer - themis::storage::"
        ROCKS[RocksDB Engine]
        BLOB[Blob Storage]
        CACHE[Cache Manager]
    end
    
    subgraph "LLM Layer - themis::llm::"
        LLM[Llama.cpp Engine]
        LORA[LoRA Framework]
        EMB[Embeddings]
    end
    
    subgraph "Distributed Layer - themis::sharding::"
        RAFT[Raft Consensus]
        SHARD[Shard Router]
        COORD[Coordinator]
    end
    
    C1 & C2 & C3 & C4 --> API
    API --> AUTH --> RATE
    RATE --> PARSER --> OPT --> EXEC
    EXEC --> VIDX & GIDX & SIDX
    EXEC --> LLM
    VIDX & GIDX & SIDX --> ROCKS
    ROCKS --> BLOB
    ROCKS <--> CACHE
    LLM --> LORA --> EMB
    ROCKS <--> RAFT
    RAFT --> SHARD --> COORD

Data Flow Diagram

sequenceDiagram
    participant Client
    participant API as API Gateway
    participant Auth as Authentication
    participant Parser as Query Parser
    participant Optimizer as Query Optimizer
    participant Executor as Execution Engine
    participant Index as Index Layer
    participant Storage as Storage Engine
    participant Replication as Replication
    
    Client->>API: HTTP/gRPC Request
    API->>Auth: Validate Credentials
    Auth-->>API: Token Valid
    API->>Parser: Parse Query
    Parser-->>API: AST
    API->>Optimizer: Optimize Query
    Optimizer-->>API: Execution Plan
    API->>Executor: Execute Plan
    Executor->>Index: Lookup Data
    Index->>Storage: Read/Write
    Storage->>Replication: Replicate (if write)
    Storage-->>Index: Data
    Index-->>Executor: Results
    Executor-->>API: Response
    API-->>Client: JSON/Binary Response

Namespace Dependency Graph

graph LR
    subgraph "Core"
        CORE[themis::core]
        UTILS[themis::utils]
        BASE[themis::base]
    end
    
    subgraph "Storage"
        STORAGE[themis::storage]
        CACHE[themis::cache]
        INDEX[themis::index]
    end
    
    subgraph "Query"
        QUERY[themis::query]
        FUNCTIONS[themis::query::functions]
    end
    
    subgraph "API"
        SERVER[themis::server]
        AUTH[themis::auth]
        NETWORK[themis::network]
    end
    
    subgraph "Advanced"
        LLM[themis::llm]
        LORA[themis::llm::lora_framework]
        SHARDING[themis::sharding]
        GRAPH[themis::graph]
    end
    
    CORE --> UTILS
    STORAGE --> CORE
    CACHE --> STORAGE
    INDEX --> STORAGE
    QUERY --> STORAGE
    QUERY --> INDEX
    FUNCTIONS --> QUERY
    SERVER --> AUTH
    SERVER --> QUERY
    AUTH --> CORE
    NETWORK --> CORE
    LLM --> STORAGE
    LORA --> LLM
    SHARDING --> STORAGE
    SHARDING --> NETWORK
    GRAPH --> INDEX

Dependency Management

Core Dependencies

Dependency	Purpose	Version	License
RocksDB	Storage engine	8.x+	Apache 2.0
Boost	C++ libraries (ASIO, Beast)	1.70+	Boost
llama.cpp	LLM inference	Latest	MIT
OpenSSL	TLS/Encryption	3.x	Apache 2.0
gRPC	RPC framework	1.50+	Apache 2.0
Protobuf	Serialization	3.21+	BSD
FAISS	Vector search	1.7+	MIT

Optional Dependencies (by Edition)

COMMUNITY:

libcurl - HTTP client
yaml-cpp - Configuration parsing

ENTERPRISE:

CUDA - NVIDIA GPU acceleration
HIP - AMD GPU acceleration
Vulkan - Cross-platform GPU compute
NCCL/RCCL - Multi-GPU communication

HYPERSCALER:

Additional cloud SDK integrations
Advanced monitoring libraries

Build System

# Edition selection
cmake -B build -DTHEMIS_EDITION=ENTERPRISE

# Feature flags
-DENABLE_LLM=ON          # LLM integration
-DENABLE_GPU=ON          # GPU acceleration
-DENABLE_ENCRYPTION=ON   # Field encryption
-DENABLE_SHARDING=ON     # Distributed mode

See cmake/editions/ and cmake/features/ for detailed options.

Common Development Patterns

Adding a New Query Function

Location: src/query/functions/ and include/query/functions/

// 1. Define in header (include/query/functions/my_functions.h)
namespace themis::query::functions {
    
class MyFunction : public FunctionInterface {
public:
    Value execute(const std::vector<Value>& args) override;
    std::string getName() const override { return "MY_FUNC"; }
    size_t getMinArgs() const override { return 1; }
};

} // namespace themis::query::functions

// 2. Implement (src/query/functions/my_functions.cpp)
namespace themis::query::functions {

Value MyFunction::execute(const std::vector<Value>& args) {
    // Implementation
    return result;
}

} // namespace themis::query::functions

// 3. Register in function registry
REGISTER_FUNCTION(MyFunction);

Adding a New API Handler

Location: src/server/ and include/server/

// 1. Create handler (include/server/my_api_handler.h)
namespace themis::server {

class MyApiHandler : public ApiHandlerInterface {
public:
    void handleRequest(const Request& req, Response& resp) override;
    std::string getPath() const override { return "/api/v1/myendpoint"; }
};

} // namespace themis::server

// 2. Register in server initialization
server.registerHandler(std::make_unique<MyApiHandler>());

Adding a New Index Type

Location: src/index/ and include/index/

// 1. Implement index interface
namespace themis::index {

class MyIndex : public IndexInterface {
public:
    void insert(const Key& key, const Value& value) override;
    std::vector<Value> search(const Query& query) override;
    void remove(const Key& key) override;
};

} // namespace themis::index

// 2. Register index type
REGISTER_INDEX_TYPE("my_index", MyIndex);

Troubleshooting Guide

Build Issues

Problem: CMake configuration fails

# Solution: Update vcpkg and dependencies
cd vcpkg && git pull
./bootstrap-vcpkg.sh
./vcpkg install

Problem: Linker errors with RocksDB

# Solution: Clean and rebuild with verbose output
rm -rf build/
cmake -B build -DCMAKE_VERBOSE_MAKEFILE=ON
cmake --build build -j$(nproc) 2>&1 | tee build.log

Runtime Issues

Problem: GPU not detected

# Check GPU availability
nvidia-smi  # NVIDIA
rocm-smi    # AMD

# Enable GPU for vector indexing in config/config.yaml
vector_index:
  use_gpu: true

Problem: Out of memory during vector indexing

# Adjust vector index and storage cache settings in config/config.yaml
vector_index:
  # Reduce concurrency or segment size if you see OOM during indexing
  max_concurrent_builds: 4
  max_segment_size_mb: 256

rocksdb:
  # Increase block cache if you have enough RAM to reduce read amplification
  block_cache_size_mb: 2048

Performance Tuning

Query Performance:

Check query plan with EXPLAIN command
Verify appropriate indices exist
Monitor with PROFILE command
Review slow query log

Storage Performance:

Adjust RocksDB compaction settings
Enable compression (LZ4/Zstd)
Monitor disk I/O with iostat
Check cache hit rates

Replication Performance:

Tune Raft heartbeat interval
Adjust batch size for writes
Monitor network latency
Use async replication for reads

Glossary

Term	Definition
AQL	Advanced Query Language - ThemisDB's SQL-like query language
HNSW	Hierarchical Navigable Small World - Vector search algorithm
LoRA	Low-Rank Adaptation - Efficient LLM fine-tuning method
MVCC	Multi-Version Concurrency Control - Transaction isolation technique
SAGA	Sequence of transactions for distributed operations
Raft	Consensus algorithm for leader-based replication
Paxos	Consensus algorithm for leaderless replication
KV-Cache	Key-Value cache for LLM inference optimization
Flash Attention	Memory-efficient attention mechanism for transformers
Sharding	Horizontal data partitioning across nodes
RBAC	Role-Based Access Control
CDC	Change Data Capture - Track data modifications
PITR	Point-In-Time Recovery
WAL	Write-Ahead Log
LSM	Log-Structured Merge-tree (RocksDB storage structure)

Frequently Asked Questions

General

Q: What makes ThemisDB different from other databases?
A: Native multi-model support (relational, graph, vector, document) with integrated LLM capabilities, all with full ACID transactions.

Q: Can I use ThemisDB without the LLM features?
A: Yes, LLM features are optional. Use MINIMAL or COMMUNITY editions for traditional database functionality.

Q: Is ThemisDB production-ready?
A: ThemisDB is designed as a production-ready multi-model database with comprehensive testing, monitoring, and enterprise features. Currently at v1.5.0-dev. See CHANGELOG.md for version-specific details and README.md for current production status.

Architecture

Q: How does ThemisDB handle distributed transactions?
A: Using the SAGA pattern with compensation actions, coordinated via Raft/Paxos consensus.

Q: What consensus algorithms are supported?
A: Raft (leader-based), Paxos (leaderless), and Gossip (eventual consistency) - selectable based on requirements.

Q: How is data partitioned across shards?
A: Hash-based or range-based partitioning, with support for custom partition functions.

Performance

Q: What are the performance characteristics?
A: 45K writes/s, 120K reads/s (single node). Linear scale-out to 100+ nodes. GPU acceleration provides 5x speedup for vector operations.

Q: How much memory does ThemisDB require?
A: Minimum 4GB for MINIMAL edition, 16GB recommended for ENTERPRISE with LLM features.

Q: Can ThemisDB handle petabyte-scale data?
A: Yes, with blob storage backends (S3, Azure) and horizontal sharding.

Development

Q: What languages can I use to interact with ThemisDB?
A: C++, Python, JavaScript/TypeScript, Java, Go, Rust - with official client libraries.

Q: How do I contribute to ThemisDB?
A: See CONTRIBUTING.md for guidelines. Follow namespace patterns and add tests for new features.

Q: Where can I find code examples?
A: Check the examples/ directory and online documentation.

Additional Resources

README.md: Project overview and quick start
QUICKSTART.md: Getting started guide
CHANGELOG.md: Version history and release notes
Documentation: Full online documentation
CONTRIBUTING.md: Contribution guidelines
SECURITY.md: Security policies and reporting
API Reference: API documentation
Examples: Code examples and tutorials
BRANCHING_STRATEGY.md: Git workflow guide
BENCHMARK_RUNBOOK.md: Performance testing guide

Acceleration Module ROADMAP Audit

The acceleration module ROADMAP (src/acceleration/ROADMAP.md) is subject to an automated audit process to ensure that checkbox statuses ([x], [P], [I], etc.) are consistent with the actual GitHub issue state and the presence of implementation files.

Running the Audit

# Unauthenticated (60 req/h rate limit, sufficient for one-off runs)
python3 scripts/acceleration_roadmap_audit.py

# Authenticated run (recommended — 5 000 req/h)
GITHUB_TOKEN=ghp_xxx python3 scripts/acceleration_roadmap_audit.py

# Pull token from gh CLI session
python3 scripts/acceleration_roadmap_audit.py --gh-cli

# Write reports to a custom directory
python3 scripts/acceleration_roadmap_audit.py --output-dir /tmp/audit

Reports are written to:

docs/audits/acceleration-roadmap-audit.json — machine-readable
docs/audits/acceleration-roadmap-audit.md — human-readable

The script exits with code 0 when no discrepancies are found, 1 when discrepancies are detected, and 2 on fatal errors (missing file, API failure).

Token / Scopes

Set GITHUB_TOKEN (or GH_TOKEN) to a personal access token with at least the public_repo scope. Without a token the tool falls back to unauthenticated requests which are rate-limited to 60 requests per hour per IP address.

ROADMAP Checkbox Policy

Status	Meaning	When to use
`[x]`	Done	A merged PR or commit exists AND files are present on disk
`[~]`	In progress	Active work; open PR or ongoing commits
`[P]`	PR open	PR exists but not yet merged
`[I]`	Issue open	GitHub Issue is open, work not started
`[ ]`	Planned	Planned item; no issue yet
`[?]`	Blocked	Needs human decision
`[!]`	Unclear	Status unknown; needs investigation

Rule: A closed GitHub issue alone is not sufficient to mark an item [x]. The implementation files must exist in the repository. If they do not, use [~] (in progress) or [?] (blocked) until the code is merged.

CI Integration

A GitHub Actions workflow (.github/workflows/acceleration-roadmap-audit.yml) runs the audit automatically on:

Pushes to develop or main that touch src/acceleration/ROADMAP.md
Pull requests that touch src/acceleration/** or ROADMAP.md
Manual workflow_dispatch

The workflow uploads the JSON and Markdown reports as artifacts and fails the build if any discrepancies are found.

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

ThemisDB Architecture Documentation

Overview

Main Directory Structure

/src/ - Implementation (44 Core Components)

/include/ - Public Headers

Architectural Layers

1. API & Protocol Layer (Server Tier)

2. Query Processing Layer

3. Index & Vector Layer

4. LLM Integration Layer

5. Storage Layer

6. Distributed & Sharding Layer

7. Transaction Layer

8. Content & Data Processing Layer

9. Analytics & Observability Layer

10. Governance & Compliance Layer

11. Configuration & Ingestion Layer

12. Prompt Engineering & Training Layer

Namespace Organization

Hierarchy

Key Architectural Patterns

1. Namespace Isolation

2. Interface-Based Design

3. Consensus Abstraction

4. SAGA Pattern

5. Adaptive Optimization

6. Plugin Architecture

Request Flow

Complete Request Path

Query Execution Flow Details

Resource Limits & Protection

Edition Differences

MINIMAL

COMMUNITY

ENTERPRISE

HYPERSCALER

Performance Characteristics

Benchmarks (Single Node)

Scalability

Optimization Techniques

Security Features

Encryption

Authentication

Authorization

Audit & Compliance

Starting Points for Exploration

For Developers

For Operations

For Build System

Integration Points

Client Libraries

External Systems

Plugins

Development Workflow

Building from Source

Running Locally

Testing

Contributing

Visual Architecture Diagrams

Component Interaction Diagram

Data Flow Diagram

Namespace Dependency Graph

Dependency Management

Core Dependencies

Optional Dependencies (by Edition)

Build System

Common Development Patterns

Adding a New Query Function

Adding a New API Handler

Adding a New Index Type

Troubleshooting Guide

Build Issues

Runtime Issues

Performance Tuning

`/src/` - Implementation (44 Core Components)

`/include/` - Public Headers