Operations Guide

Version: 1.4.0-alpha
Last Updated: January 2026
Purpose: Master index for production operations documentation

Overview

This operations guide provides comprehensive documentation for deploying, managing, and troubleshooting ThemisDB in production environments with GPU acceleration.

Target Audience

DevOps Engineers: Deployment and infrastructure management
Site Reliability Engineers: Monitoring, incident response, and reliability
Security Engineers: Security hardening and compliance
System Administrators: Day-to-day operations and maintenance
ML Engineers: Training and inference workload optimization

Quick Links

Getting Started

Pre-Deployment Checklist - Verify readiness before deployment
Deployment Guide - Step-by-step deployment instructions
Post-Deployment Checklist - Validation after deployment

Auto-Scaling & High Availability

Kubernetes HPA Configuration - Horizontal Pod Autoscaler setup
Scaling Runbook - Horizontal and vertical scaling procedures
Load Balancer Integration - NGINX, AWS ALB, GCP LB, Istio, HAProxy

Day-to-Day Operations

Operational Runbooks - Standard operational procedures
- Upgrade Runbook - Zero-downtime upgrade procedures
- Restore Runbook - Backup restoration procedures
- Failover Runbook - Failover and recovery procedures
- Scaling Runbook - Horizontal and vertical scaling
Monitoring Guide - Metrics, dashboards, and alerting
Troubleshooting Guide - Common issues and solutions

Operational Excellence

Disaster Recovery Plan - DR procedures, RTO/RPO, backup strategies
SLA Monitoring - Service level agreements, Prometheus alerts, Grafana dashboards
Auto-Scaling Guide - Kubernetes HPA/VPA, load balancer integration

Security & Compliance

Security Hardening - Security best practices and configuration
Compliance Checklists - SOC2, GDPR, HIPAA compliance
Incident Response - Structured incident handling
Operational Compliance Checklist - Monthly compliance verification

Performance

Performance Tuning - Optimization techniques and best practices
Load Balancer Integration - Load balancer configuration and setup

Disaster Recovery

Disaster Recovery Plan - Complete DR plan with RTO/RPO targets

Documentation Structure

docs/
├── OPERATIONS.md (this file)
└── production/
    ├── DEPLOYMENT.md                  # Installation and configuration
    ├── DISASTER_RECOVERY_PLAN.md      # Complete DR plan with RTO/RPO
    ├── LOAD_BALANCER_INTEGRATION.md   # Load balancer configuration
    ├── PERFORMANCE_TUNING.md          # Optimization guide
    ├── MONITORING.md                  # Observability setup
    ├── TROUBLESHOOTING.md             # Problem resolution
    ├── RUNBOOKS.md                    # Operational procedures
    ├── SECURITY.md                    # Security hardening
    ├── RUNBOOKS/
    │   ├── UPGRADE_RUNBOOK.md         # Upgrade procedures
    │   ├── RESTORE_RUNBOOK.md         # Backup restoration
    │   ├── FAILOVER_RUNBOOK.md        # Failover & recovery
    │   └── SCALING_RUNBOOK.md         # Scaling operations
    ├── CHECKLISTS/
    │   ├── pre_deployment.md          # Pre-deployment validation
    │   ├── post_deployment.md         # Post-deployment validation
    │   ├── incident_response.md       # Incident handling
    │   └── operational_compliance.md  # Monthly compliance check
    └── examples/
        ├── single_gpu_setup.yaml      # Single GPU configuration
        ├── multi_gpu_setup.yaml       # Multi-GPU configuration
        ├── distributed_training.yaml  # Distributed training
        └── raid_configuration.yaml    # High-availability storage

grafana/
└── dashboards/
    └── sla-monitoring.json            # SLA monitoring dashboard

prometheus/
└── alerts/
    └── sla-rules.yml                  # SLA alerting rules

helm/themisdb/
└── templates/
    ├── hpa.yaml                       # Horizontal Pod Autoscaler
    └── servicemonitor.yaml            # Prometheus ServiceMonitor

Deployment Scenarios

Single GPU Development

Use Case: Development, testing, small-scale inference

Documentation:

Hardware:

1x RTX 3090/4090 or A100
64GB RAM
500GB NVMe SSD

Multi-GPU Production

Use Case: Training workloads, high-throughput inference

Documentation:

Hardware:

4-8x A100 or H100 GPUs
256GB+ RAM
2TB+ NVMe RAID

Distributed Multi-Node

Use Case: Large-scale distributed training, enterprise deployments

Documentation:

Hardware:

Multiple nodes with 4-8 GPUs each
InfiniBand or 100 GbE networking
Shared storage (Lustre, BeeGFS)

Common Tasks

Initial Deployment

Pre-Deployment
- Complete Pre-Deployment Checklist
- Verify hardware compatibility
- Install GPU drivers and CUDA
- Configure networking and storage
Deployment
- Follow Deployment Guide
- Apply appropriate configuration (see examples)
- Configure security settings (Security Guide)
- Set up monitoring (Monitoring Guide)
Post-Deployment
- Complete Post-Deployment Checklist
- Run validation tests
- Verify monitoring and alerting
- Document deployment

Job Submission

Training Job:

# See detailed procedure in Runbooks
themisdb-cli job submit --config training-job.yaml

Inference Deployment:

# See detailed procedure in Runbooks
themisdb-cli inference deploy --model llama-2-7b

Checkpoint Management

# Save checkpoint
themisdb-cli checkpoint save --job-id <job-id>

# Restore from checkpoint
themisdb-cli job restore --checkpoint <checkpoint-id>

Checkpoint Management Runbook

LoRA Adapter Deployment

# Deploy LoRA adapter
themisdb-cli lora deploy --adapter custom-adapter

LoRA Deployment Runbook

Monitoring

View GPU Metrics:

Grafana Dashboard: http://localhost:3000
Prometheus: http://localhost:9090
GPU Status: nvidia-smi dmon
Monitoring Setup
Key Metrics
Alerting Rules

SLA Monitoring:

SLA Dashboard - Track availability, latency, and error budgets
SLA Alerting Rules - Prometheus alerts for SLA breaches
Target: 99.9% availability, P95 < 200ms, < 0.1% error rate

Troubleshooting

GPU Issues:

Performance Issues:

Training Issues:

Incident Response

Quick Response

For production incidents:

Assess Severity (P0-Critical, P1-High, P2-Medium, P3-Low)
Follow Incident Response Checklist: incident_response.md
Use Troubleshooting Guide: TROUBLESHOOTING.md
Execute Emergency Procedures: RUNBOOKS.md#emergency-procedures

Common Incidents

Incident	Response Guide
GPU Failure	Troubleshooting - GPU Errors
Out of Memory	Troubleshooting - OOM
Service Down	Runbooks - Emergency
Security Breach	Runbooks - Security Incident
Data Corruption	Troubleshooting - Data Corruption

Maintenance

Regular Maintenance

Daily:

Monitor GPU health and utilization
Review logs for errors
Check disk space
Verify backups completed

Weekly:

Review performance metrics
Check for GPU driver updates
Review security logs
Update documentation

Monthly:

RAID scrub (if applicable)
Security patching
Performance tuning review
Disaster recovery test

See Runbooks - Maintenance Windows

Planned Maintenance

Schedule maintenance window
Follow Maintenance Procedures
Complete Post-Deployment Checklist

Security

Security Checklist

See Security Hardening Guide

Compliance

Supported Standards:

SOC 2
GDPR
HIPAA

See Security - Compliance

Performance Optimization

Quick Wins

Enable Mixed Precision: 2-3x speedup
- See Performance Tuning - Mixed Precision
Optimize Batch Size: Maximize GPU utilization
- See Performance Tuning - Batch Size
Enable Gradient Checkpointing: 60-80% memory savings
- See Performance Tuning - VRAM Optimization
Use Flash Attention: 15-25% speedup, 30% memory reduction
- See Performance Tuning - Flash Attention

Performance Targets

Metric	Target
GPU Utilization (Training)	>85%
Inference P95 Latency	<100ms
Training Throughput	>1000 samples/sec
GPU Memory Usage	<90%
Uptime	>99.9%

Support

Getting Help

Check Documentation: Search this operations guide
Review Logs: sudo journalctl -u themisdb -f
Run Diagnostics: themisdb-cli debug dump
Search Issues: https://github.com/makr-code/ThemisDB/issues
Community Forum: https://github.com/makr-code/ThemisDB/discussions

Support Channels

GitHub Issues: Bug reports and feature requests
Discussions: Questions and community support
Documentation: This operations guide
Emergency: Follow on-call procedures

Creating Support Bundle

# Collect diagnostic information
themisdb-cli support-bundle --output /tmp/support-bundle.tar.gz

# Include in support request

Additional Resources

External Resources

Changelog

Version 1.0 (January 2026)

Initial production operations documentation
Comprehensive deployment guides
Performance tuning guidelines
Security hardening procedures
Operational runbooks
Troubleshooting guides
Example configurations

Feedback

We welcome feedback on this documentation:

Submit issues: https://github.com/makr-code/ThemisDB/issues
Contribute improvements: CONTRIBUTING.md
Discuss: https://github.com/makr-code/ThemisDB/discussions

Document Version: 1.0
Last Updated: January 2026
Next Review: April 2026

Quick Reference:

# Health check
themisdb-cli health

# GPU status
nvidia-smi

# Submit job
themisdb-cli job submit --config job.yaml

# Monitor job
themisdb-cli job status <job-id>

# View logs
sudo journalctl -u themisdb -f

# Backup
themisdb-cli backup create --type full

# Emergency stop
sudo systemctl stop themisdb

FilesExpand file tree

OPERATIONS.md

Latest commit

History

OPERATIONS.md

File metadata and controls

Operations Guide

Overview

Target Audience

Quick Links

Getting Started

Auto-Scaling & High Availability

Day-to-Day Operations

Operational Excellence

Security & Compliance

Performance

Disaster Recovery

Documentation Structure

Deployment Scenarios

Single GPU Development

Multi-GPU Production

Distributed Multi-Node

Common Tasks

Initial Deployment

Job Submission

Checkpoint Management

LoRA Adapter Deployment

Monitoring

Troubleshooting

Incident Response

Quick Response

Common Incidents

Maintenance

Regular Maintenance

Planned Maintenance

Security

Security Checklist

Compliance

Performance Optimization

Quick Wins

Performance Targets

Support

Getting Help

Support Channels

Creating Support Bundle

Additional Resources

Related Documentation

External Resources

Changelog

Version 1.0 (January 2026)

Feedback