Build: cmake --preset release && cmake --build build/release --target <target>

ThemisDB Scheduler Module

The Scheduler module provides ThemisDB's task scheduling and automation implementation. It enables cron-like periodic execution of AQL queries and custom functions for data processing, maintenance, backup, retention, and analytics workflows. The module includes a generic task scheduler and a specialized hybrid retention manager for time-series data lifecycle management.

Relevant Interfaces

Interface / File	Role
`task_scheduler.cpp`	Task scheduling engine with thread pool, cron parsing, dynamic scaling, DAG execution
`hybrid_retention_manager.cpp`	Three-stage time-series data lifecycle
`distributed_task_coordinator.cpp`	Distributed leader election for scheduled tasks
`external_scheduler_adapter.cpp`	Integration with external schedulers (Kubernetes CronJob, Airflow)
`task_audit_manager.cpp`	Searchable task execution audit log
`task_anomaly_detector.cpp`	Anomaly detection for task execution patterns
`event_trigger.cpp`	CDC event-driven task triggers
`task_result_store.cpp`	Persistent task execution results
`../utils/cron_parser.cpp`	Full cron expression parsing (v1.5.0)

Current Delivery Status

Maturity: 🟢 Production-Ready — Full cron expression parsing (v1.5.0) complete; thread pool task scheduler, hybrid retention manager, distributed task coordination, DAG execution with conditional branching, SLA alerts, audit history, and dynamic concurrency scaling all production-ready.

Scope

In Scope:

Task scheduler implementation with thread pool
AQL query execution via QueryEngine integration
Custom function registration and execution
Task persistence and recovery from disk
Hybrid retention manager (3-stage data lifecycle)
Task statistics, monitoring, and audit logging
Security validation (AQL injection detection, resource limits)
Rate limiting and resource management
OpenTelemetry tracing integration
Full cron expression parsing (wildcards, ranges, lists with embedded ranges/steps, start/step syntax, month/weekday name aliases, @-specials, 6-field year constraint, timezone-aware scheduling)
Distributed task coordination across nodes with leader election
Task dependency DAG execution with conditional branching
Workflow engine (multi-step DAG with conditional branching)
Task retry policies (max attempts, exponential/linear/jitter/Fibonacci backoff)
Scheduled task output persistence (store results in ThemisDB)
Task execution history with searchable audit log
SLA monitoring (alert on task failure or SLA breach via Alertmanager)
Dynamic concurrency scaling based on queue depth
Integration with external schedulers (Kubernetes CronJob, Apache Airflow)
CDC event-driven task triggers
Authenticated user context propagation (RequestContext TLS API; audit events carry actual user_id/client_ip instead of hardcoded "system")
Sandbox execution (sandbox_execution config flag wraps task functions in ModuleSandbox for OS-level resource isolation)

Out of Scope:

Authentication/authorization logic (handled by auth module)
Query parsing (handled by query module)
Storage operations (handled by storage module)

Key Components

TaskScheduler

Location: task_scheduler.cpp, ../include/scheduler/task_scheduler.h

Core scheduler implementation providing periodic task execution with comprehensive security controls and distributed tracing integration.

Thread Safety: All operations are thread-safe with internal locking.

Performance: <1% CPU overhead, 50-200ms task startup latency.

See full documentation in README for implementation details.

HybridRetentionManager

Location: hybrid_retention_manager.cpp, ../include/scheduler/hybrid_retention_manager.h

Three-stage data lifecycle management achieving 99.9% storage reduction for time-series data.

Stages:

Gorilla compression (0-7 days): 10-20x reduction
Adaptive retention (7-365 days): Variance-based downsampling
Time-based retention (>1 year): Daily aggregates

See full documentation in README for configuration and usage.

CronExpression Parser

Location: ../utils/cron_parser.cpp, ../include/utils/cron_parser.h

Full standard cron expression parser supporting all standard syntax elements:

Wildcards *, ranges 1-5, lists 1,3,5, steps */15, start/step 5/15
List items may contain ranges or steps: 1,3-5,*/10
Month name aliases: JAN–DEC (also full names, case-insensitive)
Weekday name aliases: SUN–SAT (also full names, case-insensitive)
Special expressions: @daily, @hourly, @monthly, @weekly, @yearly, @reboot
Optional 6-field form with year constraint: 0 9 * * MON 2025
Timezone-aware getNextExecution(from, tz_offset_seconds) overload

Wissenschaftliche Grundlagen

The following peer-reviewed papers and specifications form the scientific foundation of the Scheduler module's core algorithms and design decisions.

[1] Gorilla: A Fast, Scalable, In-Memory Time Series Database

Used in: HybridRetentionManager – Stage 1 Gorilla compression (0–7 days, 10–20× reduction)

Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., & Veeraraghavan, K. (2015). Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proceedings of the VLDB Endowment, 8(12), 1816–1827. DOI: 10.14778/2824032.2824078 URL: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Introduces the Gorilla time-series compression algorithm using XOR delta-of-delta encoding for floating-point values and variable-length encoding for timestamps. Achieves 1.37 bytes per data point on average. The GorillaEncoder in src/timeseries/gorilla.* directly implements this algorithm.

[2] Scheduling Multithreaded Computations by Work Stealing

Used in: TaskScheduler – thread pool with work-stealing dequeue for task dispatching

Blumofe, R. D., & Leiserson, C. E. (1999). Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM (JACM), 46(5), 720–748. DOI: 10.1145/324133.324234

Provides the theoretical foundation for work-stealing thread pool schedulers. Proves that a work-stealing scheduler achieves optimal time and space bounds for fully strict multithreaded computations. The TaskScheduler's thread pool design follows the work-stealing pattern to maximise CPU utilisation across concurrent task execution.

[3] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Used in: TaskScheduler – OpenTelemetry distributed tracing integration (utils/tracing.h)

Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., & Shanbhag, C. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google Technical Report. URL: https://research.google/pubs/pub36356/

The Dapper paper introduced the span/trace model that is the conceptual basis for all modern distributed tracing systems including OpenTelemetry. ThemisDB's Tracer::startSpan API directly follows the Dapper model of parent/child spans, baggage propagation, and sampling-based trace collection used throughout the scheduler and retention manager.

[4] POSIX crontab Utility Specification (IEEE Std 1003.1-2017)

Used in: CronExpression::parse() – 5-field cron syntax definition and field semantics

IEEE and The Open Group. (2018). IEEE Std 1003.1-2017: Standard for Information Technology — Portable Operating System Interface (POSIX), Base Specifications, Issue 7 — crontab Utility. The Open Group. URL: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html

Defines the canonical cron expression grammar (minute hour day month weekday) and the semantics of wildcards, ranges, lists, and steps. The CronExpression parser in src/utils/cron_parser.cpp implements this specification exactly, extending it with name aliases (JAN–DEC, MON–SUN), a start/step shorthand, an optional year field, and @-special shorthand expressions.

[5] Downsampling Time Series for Visual Representation (LTTB)

Used in: HybridRetentionManager – Stage 2 variance-based adaptive retention (7–365 days)

Steinarsson, S. (2013). Downsampling Time Series for Visual Representation. M.Sc. thesis, School of Computer Science, Reykjavik University, Iceland. URL: http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf

Introduces the Largest Triangle Three Buckets (LTTB) algorithm for perceptually optimal time-series downsampling. The algorithm selects data points that maximise the area of the triangle formed by adjacent buckets, preserving visual features (peaks, troughs) and statistical variance. Stage 2 of the HybridRetentionManager applies variance-based point selection inspired by this approach to determine which data points to retain across the 7–365-day window.

Future Enhancements

See FUTURE_ENHANCEMENTS.md for roadmap.

Scientific References

Liu, C. L., & Layland, J. W. (1973). Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, 20(1), 46–61. https://doi.org/10.1145/321738.321743
Silberschatz, A., Galvin, P. B., & Gagne, G. (2018). Operating System Concepts (10th ed.). Wiley. ISBN: 978-1-119-32091-3
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., … Woodford, D. (2013). Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems, 31(3), 8:1–8:22. https://doi.org/10.1145/2491245
Quartz Scheduler Development Team. (2023). Quartz Scheduler: Enterprise Job Scheduling. Terracotta. http://www.quartz-scheduler.org/

Installation

This module is built as part of ThemisDB. See the root CMakeLists.txt for build configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThemisDB Scheduler Module

Relevant Interfaces

Current Delivery Status

Scope

Key Components

TaskScheduler

HybridRetentionManager

CronExpression Parser

Wissenschaftliche Grundlagen

[1] Gorilla: A Fast, Scalable, In-Memory Time Series Database

[2] Scheduling Multithreaded Computations by Work Stealing

[3] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

[4] POSIX crontab Utility Specification (IEEE Std 1003.1-2017)

[5] Downsampling Time Series for Visual Representation (LTTB)

Related Documentation

Future Enhancements

Scientific References

Installation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ThemisDB Scheduler Module

Relevant Interfaces

Current Delivery Status

Scope

Key Components

TaskScheduler

HybridRetentionManager

CronExpression Parser

Wissenschaftliche Grundlagen

[1] Gorilla: A Fast, Scalable, In-Memory Time Series Database

[2] Scheduling Multithreaded Computations by Work Stealing

[3] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

[4] POSIX crontab Utility Specification (IEEE Std 1003.1-2017)

[5] Downsampling Time Series for Visual Representation (LTTB)

Related Documentation

Future Enhancements

Scientific References

Installation