Build:
cmake --preset release && cmake --build build/release --target <target>
The Scheduler module provides ThemisDB's task scheduling and automation implementation. It enables cron-like periodic execution of AQL queries and custom functions for data processing, maintenance, backup, retention, and analytics workflows. The module includes a generic task scheduler and a specialized hybrid retention manager for time-series data lifecycle management.
| Interface / File | Role |
|---|---|
task_scheduler.cpp |
Task scheduling engine with thread pool, cron parsing, dynamic scaling, DAG execution |
hybrid_retention_manager.cpp |
Three-stage time-series data lifecycle |
distributed_task_coordinator.cpp |
Distributed leader election for scheduled tasks |
external_scheduler_adapter.cpp |
Integration with external schedulers (Kubernetes CronJob, Airflow) |
task_audit_manager.cpp |
Searchable task execution audit log |
task_anomaly_detector.cpp |
Anomaly detection for task execution patterns |
event_trigger.cpp |
CDC event-driven task triggers |
task_result_store.cpp |
Persistent task execution results |
../utils/cron_parser.cpp |
Full cron expression parsing (v1.5.0) |
Maturity: 🟢 Production-Ready — Full cron expression parsing (v1.5.0) complete; thread pool task scheduler, hybrid retention manager, distributed task coordination, DAG execution with conditional branching, SLA alerts, audit history, and dynamic concurrency scaling all production-ready.
In Scope:
- Task scheduler implementation with thread pool
- AQL query execution via QueryEngine integration
- Custom function registration and execution
- Task persistence and recovery from disk
- Hybrid retention manager (3-stage data lifecycle)
- Task statistics, monitoring, and audit logging
- Security validation (AQL injection detection, resource limits)
- Rate limiting and resource management
- OpenTelemetry tracing integration
- Full cron expression parsing (wildcards, ranges, lists with embedded ranges/steps, start/step syntax, month/weekday name aliases, @-specials, 6-field year constraint, timezone-aware scheduling)
- Distributed task coordination across nodes with leader election
- Task dependency DAG execution with conditional branching
- Workflow engine (multi-step DAG with conditional branching)
- Task retry policies (max attempts, exponential/linear/jitter/Fibonacci backoff)
- Scheduled task output persistence (store results in ThemisDB)
- Task execution history with searchable audit log
- SLA monitoring (alert on task failure or SLA breach via Alertmanager)
- Dynamic concurrency scaling based on queue depth
- Integration with external schedulers (Kubernetes CronJob, Apache Airflow)
- CDC event-driven task triggers
- Authenticated user context propagation (
RequestContextTLS API; audit events carry actualuser_id/client_ipinstead of hardcoded"system") - Sandbox execution (
sandbox_executionconfig flag wraps task functions inModuleSandboxfor OS-level resource isolation)
Out of Scope:
- Authentication/authorization logic (handled by auth module)
- Query parsing (handled by query module)
- Storage operations (handled by storage module)
Location: task_scheduler.cpp, ../include/scheduler/task_scheduler.h
Core scheduler implementation providing periodic task execution with comprehensive security controls and distributed tracing integration.
Thread Safety: All operations are thread-safe with internal locking.
Performance: <1% CPU overhead, 50-200ms task startup latency.
See full documentation in README for implementation details.
Location: hybrid_retention_manager.cpp, ../include/scheduler/hybrid_retention_manager.h
Three-stage data lifecycle management achieving 99.9% storage reduction for time-series data.
Stages:
- Gorilla compression (0-7 days): 10-20x reduction
- Adaptive retention (7-365 days): Variance-based downsampling
- Time-based retention (>1 year): Daily aggregates
See full documentation in README for configuration and usage.
Location: ../utils/cron_parser.cpp, ../include/utils/cron_parser.h
Full standard cron expression parser supporting all standard syntax elements:
- Wildcards
*, ranges1-5, lists1,3,5, steps*/15, start/step5/15 - List items may contain ranges or steps:
1,3-5,*/10 - Month name aliases:
JAN–DEC(also full names, case-insensitive) - Weekday name aliases:
SUN–SAT(also full names, case-insensitive) - Special expressions:
@daily,@hourly,@monthly,@weekly,@yearly,@reboot - Optional 6-field form with year constraint:
0 9 * * MON 2025 - Timezone-aware
getNextExecution(from, tz_offset_seconds)overload
The following peer-reviewed papers and specifications form the scientific foundation of the Scheduler module's core algorithms and design decisions.
Used in: HybridRetentionManager – Stage 1 Gorilla compression (0–7 days, 10–20× reduction)
Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., & Veeraraghavan, K. (2015). Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proceedings of the VLDB Endowment, 8(12), 1816–1827. DOI: 10.14778/2824032.2824078 URL: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
Introduces the Gorilla time-series compression algorithm using XOR delta-of-delta encoding for floating-point values and variable-length encoding for timestamps. Achieves 1.37 bytes per data point on average. The GorillaEncoder in src/timeseries/gorilla.* directly implements this algorithm.
Used in: TaskScheduler – thread pool with work-stealing dequeue for task dispatching
Blumofe, R. D., & Leiserson, C. E. (1999). Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM (JACM), 46(5), 720–748. DOI: 10.1145/324133.324234
Provides the theoretical foundation for work-stealing thread pool schedulers. Proves that a work-stealing scheduler achieves optimal time and space bounds for fully strict multithreaded computations. The TaskScheduler's thread pool design follows the work-stealing pattern to maximise CPU utilisation across concurrent task execution.
Used in: TaskScheduler – OpenTelemetry distributed tracing integration (utils/tracing.h)
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., & Shanbhag, C. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google Technical Report. URL: https://research.google/pubs/pub36356/
The Dapper paper introduced the span/trace model that is the conceptual basis for all modern distributed tracing systems including OpenTelemetry. ThemisDB's Tracer::startSpan API directly follows the Dapper model of parent/child spans, baggage propagation, and sampling-based trace collection used throughout the scheduler and retention manager.
Used in: CronExpression::parse() – 5-field cron syntax definition and field semantics
IEEE and The Open Group. (2018). IEEE Std 1003.1-2017: Standard for Information Technology — Portable Operating System Interface (POSIX), Base Specifications, Issue 7 — crontab Utility. The Open Group. URL: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html
Defines the canonical cron expression grammar (minute hour day month weekday) and the semantics of wildcards, ranges, lists, and steps. The CronExpression parser in src/utils/cron_parser.cpp implements this specification exactly, extending it with name aliases (JAN–DEC, MON–SUN), a start/step shorthand, an optional year field, and @-special shorthand expressions.
Used in: HybridRetentionManager – Stage 2 variance-based adaptive retention (7–365 days)
Steinarsson, S. (2013). Downsampling Time Series for Visual Representation. M.Sc. thesis, School of Computer Science, Reykjavik University, Iceland. URL: http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf
Introduces the Largest Triangle Three Buckets (LTTB) algorithm for perceptually optimal time-series downsampling. The algorithm selects data points that maximise the area of the triangle formed by adjacent buckets, preserving visual features (peaks, troughs) and statistical variance. Stage 2 of the HybridRetentionManager applies variance-based point selection inspired by this approach to determine which data points to retain across the 7–365-day window.
- Scheduler Headers - Public API
- Storage Module - Data persistence
- Query Module - AQL execution
See FUTURE_ENHANCEMENTS.md for roadmap.
-
Liu, C. L., & Layland, J. W. (1973). Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, 20(1), 46–61. https://doi.org/10.1145/321738.321743
-
Silberschatz, A., Galvin, P. B., & Gagne, G. (2018). Operating System Concepts (10th ed.). Wiley. ISBN: 978-1-119-32091-3
-
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., … Woodford, D. (2013). Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems, 31(3), 8:1–8:22. https://doi.org/10.1145/2491245
-
Quartz Scheduler Development Team. (2023). Quartz Scheduler: Enterprise Job Scheduling. Terracotta. http://www.quartz-scheduler.org/
This module is built as part of ThemisDB. See the root CMakeLists.txt for build configuration.