Binary incident prediction from multivariate cloud-monitoring metrics using a sliding-window formulation, trained with scikit-learn and PyTorch.
pip install -r requirements.txt
python train.py # trains all models, prints test-set results
jupyter notebook incident_prediction.ipynb # full interactive walkthroughpython train.py --W 60 --H 15 --n_steps 6000 --seed 42 [--no-lstm]
| Flag | Default | Meaning |
|---|---|---|
--W |
60 | Look-back window size (steps) |
--H |
15 | Alert horizon — how far ahead to warn (steps) |
--n_steps |
6000 | Dataset length |
.
├── README.md
├── requirements.txt
├── train.py ← CLI training + evaluation script
├── incident_prediction.ipynb ← Primary deliverable (interactive notebook)
└── src/
├── data_generation.py ← Synthetic time-series with incident injection
├── features.py ← Sliding-window construction + feature extraction
├── models.py ← LR, RF, GB, BiLSTM definitions
└── evaluation.py ← Metrics, threshold search, temporal split
Cloud-operations teams spend lots of time reacting to incidents after they have already degraded user experience. The goal is to shift from reactive alerting to proactive alerting , giving teams a heads up.
Success from a business perspective means:
- Reducing mean time to detection (MTTD) by issuing alerts earlier.
- Reducing false-alarm rate to prevent on-call fatigue.
- Not missing incidents.
Translate the business goal into a supervised binary classification problem:
Given the previous W timesteps of M monitoring metrics, output a probability score p̂ ∈ [0, 1] that an incident will begin within the next H timesteps.
An alert fires when p̂ ≥ τ, where threshold τ is tuned on a held-out validation set.
◄────── W steps (look-back) ────-> ◄── H steps (horizon) ->
────────────────────────────────────── ───────────────────────
[t-W, ..., t-2, t-1, t] [t+1, t+2, ..., t+H]
input features label = 1 if any incident here
| Question | Decision |
|---|---|
| What counts as an incident? | Any period where at least one metric breaches a severity threshold for ≥1 step |
| What granularity? | 1-minute scrape interval (standard Grafana/Prometheus default) |
| How early must the alert fire? | Configurable via H; default H = 15 min |
| Single-service or multi-service? | Single service per model; multi-service via one-model-per-service |
| Real-time or batch? | Real-time: inference on each new scrape |
┌─────────────────────────────────────────────────────────────────────┐
│ DATA PIPELINE │
│ │
│ Prometheus / Grafana │
│ (cpu, memory, latency, -> Ring Buffer -> Feature │
│ error_rate, disk_io, (W = 60 steps) Extractor │
│ net_throughput, ...) (statistical │
│ descriptors or │
│ raw sequence) │
└──────────────────────────────────┬──────────────────────────────────┘
│
┌─────────▼──────────┐
│ Trained Model │
│ (RF / GB / BiLSTM)│
└─────────┬──────────┘
│ p̂ ∈ [0,1]
┌─────────▼──────────┐
│ Threshold Gate │
│ p̂ ≥ τ ? │
└──────┬──────┬──────┘
│ YES │ NO
┌──────▼──┐ ┌▼──────────┐
│ ALERT │ │ No action │
│ (PagerDuty │ (cooldown │
│ Slack) │ │ if active)│
└─────────┘ └────────────┘
Training pipeline (offline):
Raw time series -> Sliding-window -> Feature matrix -> Train / Val / Test
+ incident labels (W, H params) (N × 40) (chronological split)
│
Sequence tensor -> BiLSTM training
(N × W × M)