Version: 1.4.0-alpha
Last Updated: January 2026
Purpose: Master index for production operations documentation
This operations guide provides comprehensive documentation for deploying, managing, and troubleshooting ThemisDB in production environments with GPU acceleration.
- DevOps Engineers: Deployment and infrastructure management
- Site Reliability Engineers: Monitoring, incident response, and reliability
- Security Engineers: Security hardening and compliance
- System Administrators: Day-to-day operations and maintenance
- ML Engineers: Training and inference workload optimization
- Pre-Deployment Checklist - Verify readiness before deployment
- Deployment Guide - Step-by-step deployment instructions
- Post-Deployment Checklist - Validation after deployment
- Kubernetes HPA Configuration - Horizontal Pod Autoscaler setup
- Scaling Runbook - Horizontal and vertical scaling procedures
- Load Balancer Integration - NGINX, AWS ALB, GCP LB, Istio, HAProxy
- Operational Runbooks - Standard operational procedures
- Upgrade Runbook - Zero-downtime upgrade procedures
- Restore Runbook - Backup restoration procedures
- Failover Runbook - Failover and recovery procedures
- Scaling Runbook - Horizontal and vertical scaling
- Monitoring Guide - Metrics, dashboards, and alerting
- Troubleshooting Guide - Common issues and solutions
- Disaster Recovery Plan - DR procedures, RTO/RPO, backup strategies
- SLA Monitoring - Service level agreements, Prometheus alerts, Grafana dashboards
- Auto-Scaling Guide - Kubernetes HPA/VPA, load balancer integration
- Security Hardening - Security best practices and configuration
- Compliance Checklists - SOC2, GDPR, HIPAA compliance
- Incident Response - Structured incident handling
- Operational Compliance Checklist - Monthly compliance verification
- Performance Tuning - Optimization techniques and best practices
- Load Balancer Integration - Load balancer configuration and setup
- Disaster Recovery Plan - Complete DR plan with RTO/RPO targets
docs/
├── OPERATIONS.md (this file)
└── production/
├── DEPLOYMENT.md # Installation and configuration
├── DISASTER_RECOVERY_PLAN.md # Complete DR plan with RTO/RPO
├── LOAD_BALANCER_INTEGRATION.md # Load balancer configuration
├── PERFORMANCE_TUNING.md # Optimization guide
├── MONITORING.md # Observability setup
├── TROUBLESHOOTING.md # Problem resolution
├── RUNBOOKS.md # Operational procedures
├── SECURITY.md # Security hardening
├── RUNBOOKS/
│ ├── UPGRADE_RUNBOOK.md # Upgrade procedures
│ ├── RESTORE_RUNBOOK.md # Backup restoration
│ ├── FAILOVER_RUNBOOK.md # Failover & recovery
│ └── SCALING_RUNBOOK.md # Scaling operations
├── CHECKLISTS/
│ ├── pre_deployment.md # Pre-deployment validation
│ ├── post_deployment.md # Post-deployment validation
│ ├── incident_response.md # Incident handling
│ └── operational_compliance.md # Monthly compliance check
└── examples/
├── single_gpu_setup.yaml # Single GPU configuration
├── multi_gpu_setup.yaml # Multi-GPU configuration
├── distributed_training.yaml # Distributed training
└── raid_configuration.yaml # High-availability storage
grafana/
└── dashboards/
└── sla-monitoring.json # SLA monitoring dashboard
prometheus/
└── alerts/
└── sla-rules.yml # SLA alerting rules
helm/themisdb/
└── templates/
├── hpa.yaml # Horizontal Pod Autoscaler
└── servicemonitor.yaml # Prometheus ServiceMonitor
Use Case: Development, testing, small-scale inference
Documentation:
Hardware:
- 1x RTX 3090/4090 or A100
- 64GB RAM
- 500GB NVMe SSD
Use Case: Training workloads, high-throughput inference
Documentation:
Hardware:
- 4-8x A100 or H100 GPUs
- 256GB+ RAM
- 2TB+ NVMe RAID
Use Case: Large-scale distributed training, enterprise deployments
Documentation:
Hardware:
- Multiple nodes with 4-8 GPUs each
- InfiniBand or 100 GbE networking
- Shared storage (Lustre, BeeGFS)
-
Pre-Deployment
- Complete Pre-Deployment Checklist
- Verify hardware compatibility
- Install GPU drivers and CUDA
- Configure networking and storage
-
Deployment
- Follow Deployment Guide
- Apply appropriate configuration (see examples)
- Configure security settings (Security Guide)
- Set up monitoring (Monitoring Guide)
-
Post-Deployment
- Complete Post-Deployment Checklist
- Run validation tests
- Verify monitoring and alerting
- Document deployment
Training Job:
# See detailed procedure in Runbooks
themisdb-cli job submit --config training-job.yaml
Inference Deployment:
# See detailed procedure in Runbooks
themisdb-cli inference deploy --model llama-2-7b
# Save checkpoint
themisdb-cli checkpoint save --job-id <job-id>
# Restore from checkpoint
themisdb-cli job restore --checkpoint <checkpoint-id>
# Deploy LoRA adapter
themisdb-cli lora deploy --adapter custom-adapter
View GPU Metrics:
-
Grafana Dashboard: http://localhost:3000
-
Prometheus: http://localhost:9090
-
GPU Status:
nvidia-smi dmon
SLA Monitoring:
- SLA Dashboard - Track availability, latency, and error budgets
- SLA Alerting Rules - Prometheus alerts for SLA breaches
- Target: 99.9% availability, P95 < 200ms, < 0.1% error rate
GPU Issues:
Performance Issues:
Training Issues:
For production incidents:
- Assess Severity (P0-Critical, P1-High, P2-Medium, P3-Low)
- Follow Incident Response Checklist: incident_response.md
- Use Troubleshooting Guide: TROUBLESHOOTING.md
- Execute Emergency Procedures: RUNBOOKS.md#emergency-procedures
| Incident | Response Guide |
|---|---|
| GPU Failure | Troubleshooting - GPU Errors |
| Out of Memory | Troubleshooting - OOM |
| Service Down | Runbooks - Emergency |
| Security Breach | Runbooks - Security Incident |
| Data Corruption | Troubleshooting - Data Corruption |
Daily:
- Monitor GPU health and utilization
- Review logs for errors
- Check disk space
- Verify backups completed
Weekly:
- Review performance metrics
- Check for GPU driver updates
- Review security logs
- Update documentation
Monthly:
- RAID scrub (if applicable)
- Security patching
- Performance tuning review
- Disaster recovery test
See Runbooks - Maintenance Windows
- Schedule maintenance window
- Follow Maintenance Procedures
- Complete Post-Deployment Checklist
- TLS 1.3 configured
- mTLS enabled for inter-node communication
- Disk encryption enabled
- Audit logging configured
- GPU access controls configured
- Key rotation automated
- Security monitoring active
Supported Standards:
- SOC 2
- GDPR
- HIPAA
-
Enable Mixed Precision: 2-3x speedup
-
Optimize Batch Size: Maximize GPU utilization
-
Enable Gradient Checkpointing: 60-80% memory savings
-
Use Flash Attention: 15-25% speedup, 30% memory reduction
| Metric | Target |
|---|---|
| GPU Utilization (Training) | >85% |
| Inference P95 Latency | <100ms |
| Training Throughput | >1000 samples/sec |
| GPU Memory Usage | <90% |
| Uptime | >99.9% |
- Check Documentation: Search this operations guide
- Review Logs:
sudo journalctl -u themisdb -f - Run Diagnostics:
themisdb-cli debug dump - Search Issues: https://github.com/makr-code/ThemisDB/issues
- Community Forum: https://github.com/makr-code/ThemisDB/discussions
- GitHub Issues: Bug reports and feature requests
- Discussions: Questions and community support
- Documentation: This operations guide
- Emergency: Follow on-call procedures
# Collect diagnostic information
themisdb-cli support-bundle --output /tmp/support-bundle.tar.gz
# Include in support request
- Initial production operations documentation
- Comprehensive deployment guides
- Performance tuning guidelines
- Security hardening procedures
- Operational runbooks
- Troubleshooting guides
- Example configurations
We welcome feedback on this documentation:
- Submit issues: https://github.com/makr-code/ThemisDB/issues
- Contribute improvements: CONTRIBUTING.md
- Discuss: https://github.com/makr-code/ThemisDB/discussions
Document Version: 1.0
Last Updated: January 2026
Next Review: April 2026
Quick Reference:
# Health check
themisdb-cli health
# GPU status
nvidia-smi
# Submit job
themisdb-cli job submit --config job.yaml
# Monitor job
themisdb-cli job status <job-id>
# View logs
sudo journalctl -u themisdb -f
# Backup
themisdb-cli backup create --type full
# Emergency stop
sudo systemctl stop themisdb