Priority: P2 (High - Operational Excellence)
Context
After fixing nested virtualization (Issue #8), we need comprehensive observability to detect and diagnose issues early.
Objective
Gain real-time visibility into Azure ML infrastructure and job health through monitoring, metrics, and alerting.
Implementation Plan: Week 2
Task 1: Integrate Azure Monitor [6 hours]
Collect infrastructure metrics from Azure ML compute:
New file: openadapt_evals/benchmarks/monitoring.py
from azure.monitor.query import MetricsQueryClient
class AzureMonitoringService:
def get_compute_metrics(self, compute_name: str) -> dict:
# Get CPU, memory, disk metrics for compute instance
# Query Azure Monitor API
# Return metrics dict
def check_job_health(self, job_name: str) -> bool:
# Check if job is healthy (making progress)
# Get recent logs (last 5 minutes)
# Check for error patterns
# Verify progress indicators
Metrics to track:
- CPU utilization
- Memory utilization
- Disk utilization
- Container startup time
- Docker pull duration
Task 2: Enhanced Live Dashboard [4 hours]
Extend existing LiveEvaluationTracker with infrastructure metrics.
Updates to: openadapt_evals/benchmarks/live_tracker.py
Add infrastructure metrics, performance tracking, and cost monitoring to the live dashboard.
Task 3: Alerting System [4 hours]
Implement critical and warning alerts:
Critical alerts (immediate action):
- Job stuck with no progress (greater than 10 minutes)
- Container startup timeout (greater than 10 minutes)
- Compute node unhealthy (CPU/Memory/Disk greater than 95 percent)
Warning alerts (monitor and escalate):
- Slow task execution (greater than 2x average)
- High retry rate (greater than 20 percent of jobs)
- Docker pull slow (greater than 5 minutes)
Deliverables
Success Criteria
Time Estimate
Total: 14 hours (2-3 days)
Multi-Layer Monitoring Architecture
Layer 1: Infrastructure Monitoring (Azure Monitor) - VM health, CPU, memory, disk
Layer 2: Job Lifecycle Monitoring - Job submission to completion
Layer 3: Application Monitoring (WAA tasks) - Task success/failure rates
Implementation Guide
Reference: /tmp/AZURE_LONG_TERM_SOLUTION.md - Section 5 and 7 (Phase 2)
Why This Matters
- Early detection of infrastructure issues
- Proactive alerting vs. reactive debugging
- Cost visibility and control
- Foundation for self-healing infrastructure
Dependencies
Related
Priority: P2 (High - Operational Excellence)
Context
After fixing nested virtualization (Issue #8), we need comprehensive observability to detect and diagnose issues early.
Objective
Gain real-time visibility into Azure ML infrastructure and job health through monitoring, metrics, and alerting.
Implementation Plan: Week 2
Task 1: Integrate Azure Monitor [6 hours]
Collect infrastructure metrics from Azure ML compute:
New file: openadapt_evals/benchmarks/monitoring.py
Metrics to track:
Task 2: Enhanced Live Dashboard [4 hours]
Extend existing LiveEvaluationTracker with infrastructure metrics.
Updates to: openadapt_evals/benchmarks/live_tracker.py
Add infrastructure metrics, performance tracking, and cost monitoring to the live dashboard.
Task 3: Alerting System [4 hours]
Implement critical and warning alerts:
Critical alerts (immediate action):
Warning alerts (monitor and escalate):
Deliverables
Success Criteria
Time Estimate
Total: 14 hours (2-3 days)
Multi-Layer Monitoring Architecture
Layer 1: Infrastructure Monitoring (Azure Monitor) - VM health, CPU, memory, disk
Layer 2: Job Lifecycle Monitoring - Job submission to completion
Layer 3: Application Monitoring (WAA tasks) - Task success/failure rates
Implementation Guide
Reference: /tmp/AZURE_LONG_TERM_SOLUTION.md - Section 5 and 7 (Phase 2)
Why This Matters
Dependencies
Related