Skip to content

Latest commit

 

History

History
926 lines (734 loc) · 32.5 KB

File metadata and controls

926 lines (734 loc) · 32.5 KB

ThemisDB Query Module

Module Purpose

The Query module provides ThemisDB's AQL (Advanced Query Language) query engine, featuring a cost-based optimizer, multi-model execution pipeline, and comprehensive caching infrastructure. AQL is based on ArangoDB's query language but significantly extended to support multiple query paradigms including relational, document, graph, vector, spatial, and timeseries models. It translates AQL queries into optimized execution plans and executes them across all data models with hybrid query support.

About AQL

AQL (Advanced Query Language) is ThemisDB's multi-paradigm query language inspired by ArangoDB's AQL but extended with:

  • Multi-Model Support: Seamlessly query relational tables, document collections, property graphs, vector embeddings, geospatial data, and timeseries
  • Hybrid Queries: Combine vector similarity with geospatial filters, fulltext search with graph traversal, and more
  • SQL-Like Syntax: Familiar declarative syntax with FOR-FILTER-SORT-RETURN pattern
  • Advanced Features: Window functions, CTEs, recursive queries, subqueries, LLM integration
  • 100+ Functions: String, math, array, date, geo (ST_*), vector, graph, document, JSON, AI/ML, process mining, ethics, and more

Compared to ArangoDB AQL:

  • ✅ Maintains core AQL syntax (FOR, FILTER, SORT, LIMIT, RETURN, LET, COLLECT)
  • ✅ Compatible graph traversal syntax with enhanced spatial constraints
  • ➕ Extended with vector similarity functions (SIMILARITY, COSINE_DISTANCE, L2_DISTANCE)
  • ➕ Enhanced geospatial support (ST_* functions for complex spatial queries)
  • ➕ LLM integration (LLM INFER, LLM RAG, LLM EMBED)
  • ➕ Timeseries-specific syntax and window functions
  • ➕ Advanced process mining and ethics functions
  • ➕ Query federation across distributed databases

Relevant Interfaces

Interface / File Role
aql_parser.cpp / aql_parser_json.cpp AQL and JSON query syntax parsing; AST generation
aql_translator.cpp AST → internal query representation
query_optimizer.cpp / adaptive_optimizer.cpp Logical and cost-based plan optimization
optimizer_cost_model.cpp Cardinality and selectivity estimation
query_engine.cpp Physical execution (scan, filter, sort, aggregate, join)
aql_runner.cpp Top-level query dispatch (AQL, SQL, RLS, limits, cancellation)
sql_parser.cpp SQL SELECT/INSERT/UPDATE/DELETE → AQL compatibility layer
sparql_parser.cpp SPARQL → AQL translation for RDF/knowledge-graph queries
query_cache.cpp / query_cache_manager.cpp Exact-match query result cache
semantic_cache.cpp Embedding-based similarity cache
cte_cache.cpp / cte_subquery.cpp / materialized_cte.cpp CTE evaluation and caching
cross_cluster_federation.cpp Distributed federated AQL across cluster endpoints
query_federation.cpp Internal shard-level query federation
vectorized_execution.cpp Column-store batch execution (SIMD via analytics layer)
runtime_reoptimizer.cpp Adaptive re-optimization from runtime statistics
query_plan_visualizer.cpp EXPLAIN / EXPLAIN ANALYZE plan output
query_canceller.cpp Cooperative query cancellation via request ID
result_stream.cpp / result_type_annotation.cpp Result streaming, pagination, type inference
window_evaluator.cpp / statistical_aggregator.cpp Window and statistical aggregate functions
let_evaluator.cpp LET variable binding evaluation
workload_cache_strategy.cpp Adaptive workload-based cache eviction
functions/ 100+ AQL built-in functions (math, string, geo, vector, graph, AI, …)

Scope

In Scope:

  • AQL syntax parsing and AST generation
  • Cost-based query optimization with adaptive learning
  • Multi-model query execution (relational, document, graph, vector, spatial)
  • Hybrid queries (vector+geo, fulltext+geo, graph+spatial)
  • Query result streaming and pagination
  • Query caching (exact, semantic, CTE)
  • Expression evaluation and function registry (25+ categories)
  • Subquery and CTE (Common Table Expression) support
  • Graph traversal (BFS/DFS, shortest path, recursive queries)
  • Window functions and statistical aggregation
  • Distributed query federation

Out of Scope:

  • Data storage and persistence (handled by storage module)
  • Index structures and maintenance (handled by index module)
  • Network protocol handling (handled by server module)
  • Authentication and authorization (handled by auth module)

Key Components

Note on Parser Implementations: The primary query parser is aql_parser.cpp (and aql_parser_json.cpp for JSON query objects). A separate SQL compatibility parser (sql_parser.cpp) translates basic SELECT/INSERT/UPDATE/DELETE into AQL. A SPARQL parser (sparql_parser.cpp) handles RDF / knowledge-graph queries.

AQL Parser

Location: aql_parser.cpp, ../include/query/aql_parser.h

Converts AQL query strings into Abstract Syntax Trees (AST) for execution.

Features:

  • AQL Syntax Support: FOR, FILTER, SORT, LIMIT, RETURN, LET, COLLECT, WITH
  • Expression Types: Binary/unary operators, literals, field access, function calls, arrays, objects
  • Specialized Expressions:
    • SimilarityCall: Vector similarity search
    • ProximityCall: Geospatial proximity queries
    • SubqueryExpr: Nested subqueries
    • AnyExpr, AllExpr: Quantifier expressions (ANY/ALL)
  • AST Node Types: QueryNode, ForNode, FilterNode, SortNode, LimitNode, ReturnNode, LetNode, CollectNode, WithNode
  • Error Recovery: Detailed parse error messages with line/column information

Thread Safety:

  • Parser instances are NOT thread-safe (create per-thread or use mutex)
  • Parsed AST nodes are immutable and thread-safe for reading

Usage Example:

#include "query/aql_parser.h"

// Parse a query
AQLParser parser;
std::string query = R"(
  FOR user IN users
    FILTER user.age > 30 AND user.city == 'Seattle'
    SORT user.name ASC
    LIMIT 10
    RETURN user
)";

auto result = parser.parse(query);
if (!result) {
  std::cerr << "Parse error: " << result.error() << std::endl;
  return;
}

auto ast = result.value();
// AST is ready for translation and execution

Supported AQL Features:

  • Single and multi-collection FOR loops
  • Complex boolean expressions (AND, OR, NOT)
  • Arithmetic operators (+, -, *, /, %)
  • Comparison operators (==, !=, <, >, <=, >=)
  • IN operator for membership tests
  • String, number, boolean, null literals
  • Array and object literals
  • Function calls with arbitrary arguments
  • Field access with dot notation (user.address.city)

Query Optimizer

Location: query_optimizer.cpp, ../include/query/query_optimizer.h

Cost-based optimizer that reorders predicates and selects execution strategies for minimal resource usage.

Features:

  • Cost-Based Optimization: Cardinality estimation and selectivity-based predicate ordering
  • Traditional Plan: chooseOrderForAndQuery() - uses selectivity statistics
  • NLP-Enhanced Plan: chooseOrderForAndQueryWithNLP() - incorporates semantic hints
  • Adaptive Optimization: adaptive_optimizer.cpp - runtime statistics collection and adjustment
  • Hybrid Query Plans:
    • Vector+Geo: SpatialThenVector vs VectorThenSpatial strategies
    • Fulltext+Geo: FulltextFirst vs SpatialFirst strategies
    • Graph Traversal: Cost estimation with branching factor and spatial constraints
  • Distributed Optimization: Partition pruning, shard selection, parallelism tuning
  • Cost Model: Histogram-based cardinality estimation (optimizer_cost_model.cpp)

Thread Safety:

  • Read-safe: Statistics can be read concurrently
  • Write-safe: Statistics updates use internal locking
  • Optimizer instances are NOT thread-safe (create per-query)

Configuration Example:

#include "query/query_optimizer.h"

QueryOptimizer optimizer;

// Configure cardinality estimation
optimizer.setEstimatedCardinality("user.age > 30", 1000);
optimizer.setEstimatedCardinality("user.city == 'Seattle'", 500);

// Optimize AND query
std::vector<Predicate> predicates = {
  Predicate::parse("user.age > 30"),
  Predicate::parse("user.city == 'Seattle'")
};

auto optimized_order = optimizer.chooseOrderForAndQuery(predicates);
// Executes predicates in order of increasing cardinality

Optimization Strategies:

  • Selectivity-Based: Execute most selective predicates first
  • Index Awareness: Prefer indexed predicates when available
  • Cost Model Integration: Use histogram statistics for accurate estimates
  • Adaptive Learning: Adjust estimates based on actual execution statistics

Performance Characteristics:

  • Optimization overhead: 0.1-5ms for simple queries
  • Complex queries (10+ predicates): 5-50ms optimization time
  • Adaptive adjustment: 10-20% accuracy improvement over time
  • Cost model evaluation: ~100μs per predicate

Query Engine

Location: query_engine.cpp, ../include/query/query_engine.h

Main execution hub that processes optimized queries against storage.

Features:

  • Predicate Execution:
    • AND queries: executeAndKeys(), executeAndEntities() with sequential ordering
    • OR queries: executeOrKeys(), executeOrEntities() with optional fallback
  • Graph Traversal:
    • executeRecursivePathQuery(): Multi-hop path queries with depth limits
    • executeGeneralTraversal(): BFS/DFS with filtering
  • Hybrid Multi-Model Queries:
    • executeVectorGeoQuery(): Vector similarity + spatial constraints
    • executeFilteredVectorSearch(): Attribute pre/post-filtering
    • executeRadiusVectorSearch(): Epsilon-based neighbor search
    • executeContentSearch(): Fulltext + metadata filtering
    • executeContentGeoQuery(): Fulltext + spatial hybrid
  • Advanced Features:
    • executeJoin(): Multi-FOR with LET/FILTER
    • executeGroupBy(): COLLECT/AGGREGATE operations
    • executeCTEs(): Common Table Expression support
  • Expression Evaluation: Boolean evaluation with context binding

Thread Safety:

  • Engine instances are thread-safe for concurrent query execution
  • Uses read locks for concurrent access to shared state
  • Each query gets isolated execution context

Usage Example:

#include "query/query_engine.h"

QueryEngine engine(storage, index_manager);

// Execute a simple AND query
std::vector<Predicate> predicates = {
  Predicate::fieldEquals("age", 30),
  Predicate::fieldGreaterThan("salary", 50000)
};

auto result = engine.executeAndEntities("users", predicates);
for (const auto& entity : result.value()) {
  std::cout << entity.toJson() << std::endl;
}

// Execute a vector+geo hybrid query
VectorGeoQuery vgq;
vgq.collection = "places";
vgq.vector_field = "embedding";
vgq.query_vector = {0.1, 0.2, 0.3, ...};
vgq.k = 10;
vgq.center_lat = 47.6062;
vgq.center_lon = -122.3321;
vgq.radius_km = 5.0;

auto places = engine.executeVectorGeoQuery(vgq);

Execution Strategies:

  • Short-Circuit Evaluation: Stop early when result set becomes empty
  • Lazy Evaluation: Stream results without materializing full result set
  • Batch Processing: Process entities in batches for efficiency
  • Cache Integration: Automatic cache lookups and updates

Performance Characteristics:

  • Simple queries (1-2 predicates): 1-10ms
  • Complex queries (5-10 predicates): 10-100ms
  • Graph traversal (depth 3-5): 50-500ms
  • Hybrid queries (vector+geo): 10-50ms (depends on k and radius)

Query Cache System

Location: query_cache.cpp, query_cache_manager.cpp, semantic_cache.cpp, cte_cache.cpp

Multi-tiered caching infrastructure for query results.

Query Cache (Exact Match)

Features:

  • Exact string-based result caching with TTL
  • LRU eviction policy
  • Hit rate tracking and statistics
  • Thread-safe concurrent access

Configuration:

QueryCache cache;
cache.setMaxSize(1000);  // 1000 entries
cache.setDefaultTTL(std::chrono::seconds(300));  // 5 minutes

// Cache a result
std::string query = "FOR u IN users FILTER u.age > 30 RETURN u";
nlohmann::json result = execute_query(query);
cache.put(query, result);

// Retrieve from cache
if (auto cached = cache.get(query)) {
  return *cached;  // Cache hit
}

Semantic Cache (Similarity-Based)

Features:

  • Embedding-based query similarity matching
  • Configurable similarity threshold (default: 0.90)
  • Matches semantically similar queries ("age > 30" ≈ "age >= 31")
  • LRU eviction with TTL

Usage:

SemanticCache semantic_cache(embedding_model);
semantic_cache.setSimilarityThreshold(0.90);

// Similar queries match
std::string query1 = "FOR u IN users FILTER u.age > 30 RETURN u";
std::string query2 = "FOR u IN users FILTER u.age >= 31 RETURN u";
// May hit semantic cache if similarity > 0.90

CTE Cache (Temporary Results)

Features:

  • Memory-managed CTE result storage
  • Automatic spill-to-disk (100MB default threshold)
  • Transparent disk/memory access
  • Reference counting and cleanup

Configuration:

CTECache cte_cache;
cte_cache.setMemoryThreshold(100 * 1024 * 1024);  // 100MB
cte_cache.setTempDirectory("/tmp/themisdb_cte");

// Store CTE result
std::string cte_name = "temp_users";
std::vector<Entity> cte_result = execute_subquery(...);
cte_cache.put(cte_name, cte_result);

// Access CTE (transparent disk/memory)
auto cte_data = cte_cache.get(cte_name);

Result Streaming

Location: result_stream.cpp, ../include/query/result_stream.h

Lazy result evaluation with batching, pagination, and backpressure handling.

Features:

  • Pagination Strategies:
    • OFFSET_LIMIT: Traditional offset/limit pagination
    • CURSOR_BASED: Opaque cursor for stateless pagination
    • KEYSET: Keyset pagination for stable ordering
  • Lazy Evaluation: Results generated on-demand
  • Batch Processing: Configurable batch size (default: 100 rows)
  • Backpressure Handling: Slows query execution when consumer is slow
  • Statistics Tracking: Rows returned, batch count, execution time

Usage Example:

#include "query/result_stream.h"

ResultStream stream(query_engine, query);
stream.setBatchSize(100);
stream.setPaginationStrategy(PaginationStrategy::CURSOR_BASED);

// Stream results
while (stream.hasMore()) {
  auto batch = stream.nextBatch();
  for (const auto& row : batch) {
    process(row);
  }
}

// Get statistics
auto stats = stream.getStatistics();
std::cout << "Rows returned: " << stats.rows_returned << std::endl;
std::cout << "Execution time: " << stats.execution_time_ms << "ms" << std::endl;

Function Registry

Location: ../include/query/functions/*.h, functions/*.cpp

25+ specialized function categories for domain-specific operations.

Function Categories:

  • AI/ML: ai_ml_functions.h - Inference, embeddings, model management
  • Graph: graph_functions.h, graph_extensions.h - Traversal, shortest path, centrality
  • Geospatial: geo_functions.h, crs_functions.h - ST_Distance, ST_Within, coordinate transforms
  • Vector: vector_functions.h - Cosine similarity, L2 distance, dot product
  • Fulltext: fulltext_functions.h - Text search, tokenization, ranking
  • JSON: json_path_functions.h - JSONPath queries, manipulation
  • Temporal: date_functions.h, holiday_provider.h - Date arithmetic, formatting, holiday detection
  • Domain-Specific:
    • process_mining_functions.h - Process discovery, conformance checking
    • lora_functions.h - LoRA adapter management
    • ethics_functions.h - Bias detection, fairness metrics
  • Security: security_functions.h, retention_functions.h - Encryption, access control, data retention
  • Data Manipulation: string_functions.h, math_functions.h, array_functions.h, collection_functions.h, document_functions.h, file_functions.h, relational_functions.h

Usage in AQL:

// Geospatial function
FOR place IN places
  FILTER ST_Distance(place.location, [47.6062, -122.3321]) < 5000
  RETURN place

// Vector similarity
FOR doc IN documents
  FILTER COSINE_SIMILARITY(doc.embedding, @query_vector) > 0.9
  SORT COSINE_SIMILARITY(doc.embedding, @query_vector) DESC
  LIMIT 10
  RETURN doc

// Graph traversal
FOR v, e, p IN 1..3 OUTBOUND 'users/alice' GRAPH 'social'
  RETURN p

// Process mining
FOR event IN process_events
  LET trace = PROCESS_TRACE(event.case_id, 'process_model')
  FILTER CONFORMANCE_CHECK(trace) < 0.5
  RETURN {case_id: event.case_id, conformance: CONFORMANCE_CHECK(trace)}

Advanced Features

Window Functions

Location: window_evaluator.cpp, ../include/query/window_evaluator.h

Analytical window functions for advanced aggregation.

Supported Functions:

  • ROW_NUMBER(): Sequential row numbering
  • RANK(): Rank with gaps for ties
  • DENSE_RANK(): Rank without gaps
  • NTILE(n): Distribute rows into n buckets
  • Aggregate window functions: SUM(), AVG(), MIN(), MAX(), COUNT()

Example:

FOR sale IN sales
  WINDOW w AS (
    PARTITION BY sale.region
    ORDER BY sale.amount DESC
  )
  RETURN {
    region: sale.region,
    amount: sale.amount,
    rank: RANK() OVER w,
    running_total: SUM(sale.amount) OVER w
  }

Statistical Aggregation

Location: statistical_aggregator.cpp, ../include/query/statistical_aggregator.h

Advanced statistical functions for COLLECT/GROUP BY operations.

Functions:

  • Basic: COUNT(), SUM(), AVG(), MIN(), MAX()
  • Statistical: STDDEV(), VARIANCE(), MEDIAN(), PERCENTILE()
  • Distributional: MODE(), RANGE(), IQR()

Query Federation

Location: query_federation.cpp, ../include/query/query_federation.h

Distributed query execution coordination across multiple ThemisDB nodes.

Features:

  • Partition-aware query routing
  • Parallel execution across shards
  • Result merging and aggregation
  • Cross-shard joins and unions

Architecture

Query Execution Pipeline

Query String
     ↓
[AQL Parser] ────────> AST (Abstract Syntax Tree)
     ↓
[AQL Translator] ────> Internal Query Representation
     ↓
[Query Optimizer] ───> Optimized Execution Plan
     ↓                   (with predicate ordering, index selection)
[Cache Lookup] ──────> [Query Cache / Semantic Cache]
     ↓ (miss)
[Query Engine] ──────> Storage / Index Managers
     ↓
[Result Stream] ─────> Paginated Results
     ↓
[Cache Store] ───────> Update caches
     ↓
Return Results

Thread Safety Model

  • Parser/Translator: NOT thread-safe, create per-query or use mutex
  • Optimizer: NOT thread-safe, statistics reads are safe
  • Engine: Thread-safe, concurrent query execution supported
  • Caches: Thread-safe, concurrent reads/writes with locking
  • Result Streams: NOT thread-safe, single consumer per stream

Optimization Process

  1. Parse: Query string → AST
  2. Translate: AST → Internal query with predicates
  3. Estimate: Calculate selectivity for each predicate using statistics
  4. Order: Sort predicates by selectivity (most selective first)
  5. Select Strategy: Choose execution strategy (sequential, index-based, hybrid)
  6. Execute: Run optimized plan with short-circuit evaluation
  7. Learn: Update statistics for adaptive optimization

Integration Points

Storage Module Integration

#include "query/query_engine.h"
#include "storage/storage_engine.h"

// Inject storage dependency
auto storage = std::make_shared<StorageEngine>(rocksdb_config);
QueryEngine engine(storage, index_manager);

// Engine uses storage for:
// - Entity retrieval (get, scan, prefix scan)
// - Transaction coordination
// - Batch operations

Index Module Integration

#include "query/query_engine.h"
#include "index/index_manager.h"

// Inject index manager
auto index_mgr = std::make_shared<IndexManager>(storage);
QueryEngine engine(storage, index_mgr);

// Engine uses indexes for:
// - Vector similarity search
// - Spatial range queries
// - Graph traversal
// - Secondary index lookups

LLM Integration (AQL Extensions)

#include "query/llm_aql_handler.h"

LLMAQLHandler llm_handler(llm_engine);

// LLM-specific AQL commands
std::string query = R"(
  MODEL LOAD 'gpt-4' WITH {context_size: 8192}
  INFER 'Summarize this document' ON documents LIMIT 10
)";

auto result = llm_handler.execute(query);

API Reference

Basic Query Execution

#include "query/aql_runner.h"

// High-level API (recommended)
std::string query = "FOR u IN users FILTER u.age > 30 RETURN u";
auto result = executeAql(query, engine);

if (result) {
  nlohmann::json data = result.value();
  std::cout << data.dump(2) << std::endl;
} else {
  std::cerr << "Query error: " << result.error() << std::endl;
}

Per-Query Resource Limits

executeAqlWithLimits() wraps executeAql() and enforces configurable caps on row count, memory, and wall-clock execution time. A limit value of 0 means unlimited for that dimension.

Header: include/query/query_resource_limits.h

#include "query/aql_runner.h"
#include "query/query_resource_limits.h"

QueryResourceLimits limits;
limits.max_rows         = 1000;             // max result rows  (0 = unlimited)
limits.max_memory_bytes = 4 * 1024 * 1024;  // 4 MB result size (0 = unlimited)
limits.timeout_ms       = 5000;             // 5 s wall-clock   (0 = unlimited)

auto result = executeAqlWithLimits(aql, engine, limits);
if (!result) {
  if (result.error().code() == errors::ErrorCode::ERR_QUERY_TIMEOUT) {
    // Query exceeded configured timeout
  } else if (result.error().code() == errors::ErrorCode::ERR_QUERY_RESOURCE_EXHAUSTED) {
    // Row count or memory limit exceeded
  }
}

Enforcement semantics:

Limit Checkpoint Error code
max_rows After execution – checks results array length ERR_QUERY_RESOURCE_EXHAUSTED
max_memory_bytes After execution – uses serialized JSON byte size as proxy ERR_QUERY_RESOURCE_EXHAUSTED
timeout_ms After execution returns – wall-clock from call entry ERR_QUERY_TIMEOUT

Low-level helper — QueryResourceGuard:

For code that produces rows iteratively (e.g. custom execution paths), use the QueryResourceGuard RAII class to track violations incrementally:

#include "query/query_resource_limits.h"

QueryResourceGuard guard(limits);

for (auto& row : rows) {
  auto violation = guard.checkRow(/*estimated_bytes=*/row.dump().size());
  if (violation != QueryResourceGuard::Violation::None) {
    // handle RowLimit / MemoryLimit / Timeout
    break;
  }
  process(row);
}

Query Cancellation via Request ID

executeAqlCancellable() wraps executeAql() and registers the query under a caller-assigned request ID in the process-wide QueryCanceller registry. Any other thread can cancel it at any time by calling QueryCanceller::cancel().

Header: include/query/query_canceller.h

#include "query/aql_runner.h"
#include "query/query_canceller.h"

// Execute (e.g. in an HTTP worker thread):
auto result = executeAqlCancellable(aql, engine, "request-42");

// Cancel from any other thread (e.g. HTTP cancel handler):
bool was_live = QueryCanceller::instance().cancel("request-42");

On cancellation executeAqlCancellable() returns Err(ERR_QUERY_CANCELLED, request_id). The registration is automatically cleaned up when the function returns.

Advanced Query Execution

// Lower-level API (for fine-grained control)
AQLParser parser;
AQLTranslator translator;
QueryOptimizer optimizer;

// Parse
auto ast = parser.parse(query).value();

// Translate
auto internal_query = translator.translate(ast);

// Optimize
auto optimized_plan = optimizer.optimize(internal_query);

// Execute
auto result = engine.execute(optimized_plan);

// Stream results
ResultStream stream(result);
while (stream.hasMore()) {
  auto batch = stream.nextBatch();
  process(batch);
}

Hybrid Query Example

// Vector + Geo hybrid query
std::string query = R"(
  FOR place IN places
    FILTER ST_Distance(place.location, [47.6062, -122.3321]) < 5000
    AND COSINE_SIMILARITY(place.embedding, @query_vector) > 0.9
    SORT COSINE_SIMILARITY(place.embedding, @query_vector) DESC
    LIMIT 10
    RETURN place
)";

// Bind parameters
QueryParameters params;
params.bindVector("query_vector", embedding);

auto result = executeAql(query, engine, params);

Graph Traversal Example

// Shortest path query
std::string query = R"(
  FOR v, e, p IN OUTBOUND SHORTEST_PATH
    'users/alice' TO 'users/bob'
    GRAPH 'social'
    RETURN p
)";

auto result = executeAql(query, engine);

// Recursive path query with depth limit
std::string recursive_query = R"(
  FOR v, e, p IN 1..3 OUTBOUND 'users/alice' GRAPH 'social'
    FILTER v.country == 'USA'
    RETURN p
)";

auto recursive_result = executeAql(recursive_query, engine);

Dependencies

Internal Dependencies

  • storage: Storage engine for data persistence and transaction coordination
  • index: Index manager for vector, spatial, graph, and secondary indexes
  • core: Base types, configuration, metrics
  • llm (optional): LLM engine for INFER/RAG commands

External Dependencies

Required:

  • None (query module has no external dependencies)

Optional:

  • nlohmann/json: JSON parsing (for JSON function library)
  • Boost.Geometry: Advanced geospatial functions

Build Configuration

# CMakeLists.txt
add_library(themisdb_query
  src/query/aql_parser.cpp
  src/query/aql_translator.cpp
  src/query/query_optimizer.cpp
  src/query/query_engine.cpp
  src/query/query_cache.cpp
  src/query/semantic_cache.cpp
  src/query/cte_cache.cpp
  src/query/result_stream.cpp
  src/query/aql_runner.cpp
  # ... additional files
)

target_link_libraries(themisdb_query
  PUBLIC themisdb_storage
  PUBLIC themisdb_index
  PUBLIC themisdb_core
)

Performance Characteristics

Query Execution Latency

  • Simple queries (1-2 predicates, no joins): 1-10ms
  • Complex queries (5-10 predicates, joins): 10-100ms
  • Graph traversal (depth 3-5, 100-1000 edges): 50-500ms
  • Hybrid queries (vector k=10 + spatial radius 5km): 10-50ms
  • Aggregate queries (GROUP BY with 1M rows): 100-1000ms

Optimization Overhead

  • Simple queries: 0.1-1ms
  • Complex queries (10+ predicates): 1-10ms
  • Adaptive learning update: 0.01-0.1ms per query

Cache Performance

  • Query cache hit: ~1μs (saves 10-1000ms+ execution time)
  • Semantic cache hit: ~100μs (includes embedding similarity)
  • CTE cache (in-memory): ~10μs per access
  • CTE cache (disk spill): ~5-50ms per access

Throughput

  • Cached queries: 50K-100K queries/sec
  • Simple queries: 1K-10K queries/sec
  • Complex queries: 100-1K queries/sec
  • Streaming results: 10K-100K rows/sec

Memory Usage

  • Parser/AST: 1-10KB per query
  • Query cache: ~10KB per cached query (configurable max size)
  • Semantic cache: ~20KB per cached query (includes embeddings)
  • CTE cache: 100MB default threshold (configurable)
  • Result stream buffer: 1MB per active stream (configurable)

Tuning Recommendations

  1. Cache Configuration:

    • Set query cache size based on workload repetition (1K-100K entries)
    • Enable semantic cache for analytics workloads with similar queries
    • Increase CTE memory threshold for complex queries with large intermediate results
  2. Optimizer Tuning:

    • Enable adaptive optimization for production workloads
    • Collect statistics on representative data samples
    • Monitor and adjust selectivity estimates periodically
  3. Streaming Configuration:

    • Increase batch size for high-throughput scenarios (1000+ rows)
    • Use CURSOR_BASED pagination for stateless web APIs
    • Use KEYSET pagination for stable, ordered results
  4. Predicate Ordering:

    • Ensure statistics are up-to-date for accurate selectivity estimates
    • Use EXPLAIN to verify predicate ordering in complex queries
    • Consider manual hints for queries where optimizer struggles

Known Limitations

  1. Parser Limitations:

    • No support for subqueries in SELECT clause (only in FILTER/LET)
    • Limited support for correlated subqueries
    • No support for lateral joins (planned for v1.6.0)
  2. Optimizer Limitations:

    • Cost model assumes uniform data distribution (can be inaccurate for skewed data)
    • No multi-query optimization (each query optimized independently)
    • Limited optimization for user-defined functions
  3. Execution Limitations:

    • Graph traversal depth limited to 20 hops (configurable, but memory-intensive)
    • Parallel execution is limited to the parallel-scan operator (parallel_scan.h); intra-operator parallelism within a single plan is planned for v1.6.0
    • Join algorithm selection is adaptive; hash-join and nested-loop variants are implemented in query_engine.cpp
  4. Caching Limitations:

    • Query cache uses exact string matching (whitespace-sensitive)
    • Semantic cache requires embedding model (adds latency)
    • CTE cache disk spill can be slow for very large CTEs (>1GB)
  5. Result Streaming Limitations:

    • OFFSET_LIMIT pagination can be inefficient for large offsets
    • CURSOR_BASED pagination requires stable ordering
    • No support for bidirectional streaming (forward-only)
  6. Function Limitations:

    • Some advanced functions require optional dependencies
    • WASM/Python UDF runtimes are planned; C++ UDFs can be registered via UDFRegistry in include/query/functions/udf_registry.h
    • Limited support for aggregate window functions
  7. Memory Constraints:

    • Large result sets can consume significant memory
    • CTE materialization can exceed memory limits (requires disk spill)
    • Query cache eviction can cause performance variability
  8. Thread Safety:

    • Parser/translator are not thread-safe (must create per-query or lock)
    • Result streams are single-consumer only

Status

Production Ready (as of v1.5.0)

Stable Features:

  • AQL parsing and AST generation
  • Cost-based query optimization
  • Multi-model query execution
  • Graph traversal (BFS/DFS, shortest path)
  • Hybrid queries (vector+geo, fulltext+geo)
  • Query caching (exact, semantic, CTE)
  • Result streaming and pagination
  • Expression evaluation and function registry
  • Subquery and CTE support
  • SQL dialect compatibility layer (SELECT/INSERT/UPDATE/DELETE)
  • SPARQL compatibility for RDF / knowledge-graph queries
  • Per-query resource limits (rows, memory, timeout)
  • Query cancellation via request ID (QueryCanceller)
  • Vectorized execution (column-store batch processing)
  • Cross-cluster federated AQL
  • UDF registration (UDFRegistry)
  • Query plan visualization (EXPLAIN / EXPLAIN ANALYZE)
  • Result type annotations (for SDK code generation)
  • Parallel collection scans (ParallelScanner)
  • Runtime re-optimization from execution statistics

⚠️ Beta Features:

  • Adaptive query optimization (v1.5.0+)
  • Semantic cache (v1.5.0+)
  • Query federation (v1.5.0+)
  • Window functions (v1.5.0+)

🔬 Experimental:

  • NLP-enhanced optimization (v1.5.0+)
  • Distributed query planning (v1.5.0+)

Related Documentation

Last Updated: March 2026
Module Version: v1.5.0
Next Review: v1.6.0 Release

Scientific References

  1. Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., & Price, T. G. (1979). Access Path Selection in a Relational Database Management System. Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, 23–34. https://doi.org/10.1145/582095.582099

  2. Graefe, G., & McKenna, W. J. (1993). The Volcano Model: An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering, 6(1), 120–135. https://doi.org/10.1109/69.273032

  3. Ioannidis, Y. E. (1996). Query Optimization. ACM Computing Surveys, 28(1), 121–123. https://doi.org/10.1145/234313.234367

  4. Graefe, G. (1993). Query Evaluation Techniques for Large Databases. ACM Computing Surveys, 25(2), 73–170. https://doi.org/10.1145/152610.152611

  5. Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How Good Are Query Optimizers, Really? Proceedings of the VLDB Endowment, 9(3), 204–215. https://doi.org/10.14778/2850583.2850594