Skip to content

tuecoder/Monitoring-Mockup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Incident Prediction with Sliding-Window Time-Series Models

Binary incident prediction from multivariate cloud-monitoring metrics using a sliding-window formulation, trained with scikit-learn and PyTorch.


Quick Start

pip install -r requirements.txt
python train.py                          # trains all models, prints test-set results
jupyter notebook incident_prediction.ipynb   # full interactive walkthrough
python train.py --W 60 --H 15 --n_steps 6000 --seed 42 [--no-lstm]
Flag Default Meaning
--W 60 Look-back window size (steps)
--H 15 Alert horizon — how far ahead to warn (steps)
--n_steps 6000 Dataset length

Project Structure

.
├── README.md
├── requirements.txt
├── train.py                       ← CLI training + evaluation script
├── incident_prediction.ipynb      ← Primary deliverable (interactive notebook)
└── src/
    ├── data_generation.py         ← Synthetic time-series with incident injection
    ├── features.py                ← Sliding-window construction + feature extraction
    ├── models.py                  ← LR, RF, GB, BiLSTM definitions
    └── evaluation.py              ← Metrics, threshold search, temporal split

Problem Framing

Cloud-operations teams spend lots of time reacting to incidents after they have already degraded user experience. The goal is to shift from reactive alerting to proactive alerting , giving teams a heads up.

Success from a business perspective means:

  • Reducing mean time to detection (MTTD) by issuing alerts earlier.
  • Reducing false-alarm rate to prevent on-call fatigue.
  • Not missing incidents.

ML Objective

Translate the business goal into a supervised binary classification problem:

Given the previous W timesteps of M monitoring metrics, output a probability score p̂ ∈ [0, 1] that an incident will begin within the next H timesteps.

An alert fires when p̂ ≥ τ, where threshold τ is tuned on a held-out validation set.

  ◄────── W steps (look-back) ────->   ◄── H steps (horizon) ->
  ──────────────────────────────────────  ───────────────────────
  [t-W, ..., t-2, t-1, t]                [t+1, t+2, ..., t+H]
        input features                    label = 1 if any incident here

Key Clarifications & Scope

Question Decision
What counts as an incident? Any period where at least one metric breaches a severity threshold for ≥1 step
What granularity? 1-minute scrape interval (standard Grafana/Prometheus default)
How early must the alert fire? Configurable via H; default H = 15 min
Single-service or multi-service? Single service per model; multi-service via one-model-per-service
Real-time or batch? Real-time: inference on each new scrape

High-Level Design

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        DATA PIPELINE                                │
 │                                                                     │
 │  Prometheus / Grafana                                               │
 │  (cpu, memory, latency,  ->  Ring Buffer  ->  Feature               │
 │   error_rate, disk_io,       (W = 60 steps)   Extractor             │
│   net_throughput, ...)                         (statistical          │
│                                                descriptors or        │
│                                                raw sequence)         │
 └──────────────────────────────────┬──────────────────────────────────┘
                                    │
                          ┌─────────▼──────────┐
                          │   Trained Model    │
                          │  (RF / GB / BiLSTM)│
                          └─────────┬──────────┘
                                    │  p̂ ∈ [0,1]
                          ┌─────────▼──────────┐
                          │  Threshold Gate    │
                          │      p̂ ≥ τ ?       │
                          └──────┬──────┬──────┘
                                 │ YES  │ NO
                          ┌──────▼──┐  ┌▼──────────┐
                          │  ALERT  │  │  No action │
                          │ (PagerDuty │  (cooldown │
                          │  Slack)  │ │  if active)│
                          └─────────┘  └────────────┘

Training pipeline (offline):

  Raw time series  ->  Sliding-window  ->  Feature matrix  ->  Train / Val / Test
  + incident labels       (W, H params)       (N × 40)             (chronological split)
                                │
                        Sequence tensor  ->  BiLSTM training
                          (N × W × M)

About

JetBrains task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors