Incident Prediction with Sliding-Window Time-Series Models

Binary incident prediction from multivariate cloud-monitoring metrics using a sliding-window formulation, trained with scikit-learn and PyTorch.

Quick Start

pip install -r requirements.txt
python train.py                          # trains all models, prints test-set results
jupyter notebook incident_prediction.ipynb   # full interactive walkthrough

python train.py --W 60 --H 15 --n_steps 6000 --seed 42 [--no-lstm]

Flag	Default	Meaning
`--W`	60	Look-back window size (steps)
`--H`	15	Alert horizon — how far ahead to warn (steps)
`--n_steps`	6000	Dataset length

Project Structure

.
├── README.md
├── requirements.txt
├── train.py                       ← CLI training + evaluation script
├── incident_prediction.ipynb      ← Primary deliverable (interactive notebook)
└── src/
    ├── data_generation.py         ← Synthetic time-series with incident injection
    ├── features.py                ← Sliding-window construction + feature extraction
    ├── models.py                  ← LR, RF, GB, BiLSTM definitions
    └── evaluation.py              ← Metrics, threshold search, temporal split

Problem Framing

Cloud-operations teams spend lots of time reacting to incidents after they have already degraded user experience. The goal is to shift from reactive alerting to proactive alerting , giving teams a heads up.

Success from a business perspective means:

Reducing mean time to detection (MTTD) by issuing alerts earlier.
Reducing false-alarm rate to prevent on-call fatigue.
Not missing incidents.

ML Objective

Translate the business goal into a supervised binary classification problem:

Given the previous W timesteps of M monitoring metrics, output a probability score p̂ ∈ [0, 1] that an incident will begin within the next H timesteps.

An alert fires when p̂ ≥ τ, where threshold τ is tuned on a held-out validation set.

  ◄────── W steps (look-back) ────->   ◄── H steps (horizon) ->
  ──────────────────────────────────────  ───────────────────────
  [t-W, ..., t-2, t-1, t]                [t+1, t+2, ..., t+H]
        input features                    label = 1 if any incident here

Key Clarifications & Scope

Question	Decision
What counts as an incident?	Any period where at least one metric breaches a severity threshold for ≥1 step
What granularity?	1-minute scrape interval (standard Grafana/Prometheus default)
How early must the alert fire?	Configurable via H; default H = 15 min
Single-service or multi-service?	Single service per model; multi-service via one-model-per-service
Real-time or batch?	Real-time: inference on each new scrape

High-Level Design

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        DATA PIPELINE                                │
 │                                                                     │
 │  Prometheus / Grafana                                               │
 │  (cpu, memory, latency,  ->  Ring Buffer  ->  Feature               │
 │   error_rate, disk_io,       (W = 60 steps)   Extractor             │
│   net_throughput, ...)                         (statistical          │
│                                                descriptors or        │
│                                                raw sequence)         │
 └──────────────────────────────────┬──────────────────────────────────┘
                                    │
                          ┌─────────▼──────────┐
                          │   Trained Model    │
                          │  (RF / GB / BiLSTM)│
                          └─────────┬──────────┘
                                    │  p̂ ∈ [0,1]
                          ┌─────────▼──────────┐
                          │  Threshold Gate    │
                          │      p̂ ≥ τ ?       │
                          └──────┬──────┬──────┘
                                 │ YES  │ NO
                          ┌──────▼──┐  ┌▼──────────┐
                          │  ALERT  │  │  No action │
                          │ (PagerDuty │  (cooldown │
                          │  Slack)  │ │  if active)│
                          └─────────┘  └────────────┘

Training pipeline (offline):

  Raw time series  ->  Sliding-window  ->  Feature matrix  ->  Train / Val / Test
  + incident labels       (W, H params)       (N × 40)             (chronological split)
                                │
                        Sequence tensor  ->  BiLSTM training
                          (N × W × M)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incident Prediction with Sliding-Window Time-Series Models

Quick Start

Project Structure

Problem Framing

ML Objective

Key Clarifications & Scope

High-Level Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
README.md		README.md
incident_prediction.ipynb		incident_prediction.ipynb
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Incident Prediction with Sliding-Window Time-Series Models

Quick Start

Project Structure

Problem Framing

ML Objective

Key Clarifications & Scope

High-Level Design

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages