[P2] Azure observability and monitoring (Week 2)

## Priority: P2 (High - Operational Excellence)

## Context
After fixing nested virtualization (Issue #8), we need comprehensive observability to detect and diagnose issues early.

## Objective
Gain real-time visibility into Azure ML infrastructure and job health through monitoring, metrics, and alerting.

## Implementation Plan: Week 2

### Task 1: Integrate Azure Monitor [6 hours]
Collect infrastructure metrics from Azure ML compute:

**New file:** openadapt_evals/benchmarks/monitoring.py

```python
from azure.monitor.query import MetricsQueryClient

class AzureMonitoringService:
    def get_compute_metrics(self, compute_name: str) -> dict:
        # Get CPU, memory, disk metrics for compute instance
        # Query Azure Monitor API
        # Return metrics dict
    
    def check_job_health(self, job_name: str) -> bool:
        # Check if job is healthy (making progress)
        # Get recent logs (last 5 minutes)
        # Check for error patterns
        # Verify progress indicators
```

**Metrics to track:**
- CPU utilization
- Memory utilization
- Disk utilization
- Container startup time
- Docker pull duration

### Task 2: Enhanced Live Dashboard [4 hours]
Extend existing LiveEvaluationTracker with infrastructure metrics.

**Updates to:** openadapt_evals/benchmarks/live_tracker.py

Add infrastructure metrics, performance tracking, and cost monitoring to the live dashboard.

### Task 3: Alerting System [4 hours]
Implement critical and warning alerts:

**Critical alerts (immediate action):**
- Job stuck with no progress (greater than 10 minutes)
- Container startup timeout (greater than 10 minutes)
- Compute node unhealthy (CPU/Memory/Disk greater than 95 percent)

**Warning alerts (monitor and escalate):**
- Slow task execution (greater than 2x average)
- High retry rate (greater than 20 percent of jobs)
- Docker pull slow (greater than 5 minutes)

## Deliverables
- [ ] monitoring.py with Azure Monitor integration
- [ ] Enhanced LiveEvaluationTracker with infrastructure metrics
- [ ] Alerting system with critical/warning rules
- [ ] Updated viewer HTML with new metrics dashboard

## Success Criteria
- [ ] Real-time infrastructure metrics visible in dashboard
- [ ] Alerts trigger within 5 minutes of issues
- [ ] Cost tracking accurate within 5 percent
- [ ] Dashboard shows all 3 metric layers (infrastructure, job, application)

## Time Estimate
**Total: 14 hours** (2-3 days)

## Multi-Layer Monitoring Architecture

Layer 1: Infrastructure Monitoring (Azure Monitor) - VM health, CPU, memory, disk
Layer 2: Job Lifecycle Monitoring - Job submission to completion
Layer 3: Application Monitoring (WAA tasks) - Task success/failure rates

## Implementation Guide
Reference: /tmp/AZURE_LONG_TERM_SOLUTION.md - Section 5 and 7 (Phase 2)

## Why This Matters
- Early detection of infrastructure issues
- Proactive alerting vs. reactive debugging
- Cost visibility and control
- Foundation for self-healing infrastructure

## Dependencies
- Requires Issue #8 (Week 1 fixes) to be complete
- Azure Monitor API access configured

## Related
- Week 1 implementation: Issue #8
- Cost optimization: Issue #9
- Long-term solution: /tmp/AZURE_LONG_TERM_SOLUTION.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P2] Azure observability and monitoring (Week 2) #10

Priority: P2 (High - Operational Excellence)

Context

Objective

Implementation Plan: Week 2

Task 1: Integrate Azure Monitor [6 hours]

Task 2: Enhanced Live Dashboard [4 hours]

Task 3: Alerting System [4 hours]

Deliverables

Success Criteria

Time Estimate

Multi-Layer Monitoring Architecture

Implementation Guide

Why This Matters

Dependencies

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[P2] Azure observability and monitoring (Week 2) #10

Description

Priority: P2 (High - Operational Excellence)

Context

Objective

Implementation Plan: Week 2

Task 1: Integrate Azure Monitor [6 hours]

Task 2: Enhanced Live Dashboard [4 hours]

Task 3: Alerting System [4 hours]

Deliverables

Success Criteria

Time Estimate

Multi-Layer Monitoring Architecture

Implementation Guide

Why This Matters

Dependencies

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions