From 9f9eea079790fba91a6144fe320405e2471b8afc Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Fri, 5 Jun 2026 19:43:52 +0530 Subject: [PATCH] feat(observability): scale-to-zero metric alert + tile + catalog (Task #54) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rule 25 artifacts for instant_deploy_scaled_to_zero_total{outcome} + instant_deploy_idle_apps (emitted by worker/internal/jobs/deploy_idle_scaler.go). - Prom rules (instant-worker-deploy-scale-to-zero group): DeployScaleToZeroWakeFailed (wake_failed > 0 / 15m → P1: app stuck asleep), DeployScaleToZeroScaleDownFailures (scale_failed >= 5 / 30m → P2: savings not landing, no customer impact). - NR alert: deploy-scale-to-zero-fail.json (wake_failed → CRITICAL). - Dashboard: two tiles on instanode-reliability.json — asleep-apps billboard + scale-outcome stacked-bar. - METRICS-CATALOG.md: rows for the counter (lazy, 4 outcomes primed) + the gauge (eager). All INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED — the series stay 0 and the alerts stay quiet until the feature is canaried on. Co-Authored-By: Claude Opus 4.8 (1M context) --- k8s/prometheus-rules.yaml | 42 +++++++ .../alerts/deploy-scale-to-zero-fail.json | 31 +++++ .../dashboards/instanode-reliability.json | 108 +++++++++++++----- observability/METRICS-CATALOG.md | 2 + 4 files changed, 154 insertions(+), 29 deletions(-) create mode 100644 newrelic/alerts/deploy-scale-to-zero-fail.json diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml index 14d1661..98a1a51 100644 --- a/k8s/prometheus-rules.yaml +++ b/k8s/prometheus-rules.yaml @@ -917,6 +917,48 @@ spec: build pod logs in instant-deploy-* namespaces. Source: worker/internal/jobs/deploy_status_reconcile.go. + # instant-worker — scale-to-zero idle-scaler (Task #54). + # instant_deploy_scaled_to_zero_total{outcome} is emitted by + # worker/internal/jobs/deploy_idle_scaler.go. INERT until an operator sets + # DEPLOY_SCALE_TO_ZERO_ENABLED (default off), so these rules sit quiet until + # the feature is canaried on. wake_failed = a user's app may be stuck asleep + # (P1, user-visible recoverable); scale_failed = a scale-DOWN k8s patch / DB + # flip failed (P2, the row is left untouched + retried next tick). + - name: instant-worker-deploy-scale-to-zero + rules: + - alert: DeployScaleToZeroWakeFailed + expr: | + sum(increase(instant_deploy_scaled_to_zero_total{outcome="wake_failed"}[15m])) > 0 + for: 10m + labels: + severity: critical + service: worker + annotations: + summary: "scale-to-zero wake failures > 0 (15m) — an app may be stuck asleep" + description: | + instant_deploy_scaled_to_zero_total{outcome="wake_failed"} increased over the + last 15m, sustained 10m. A scaled-to-zero app failed to wake (k8s scale-up + error), so a customer request to a sleeping app is returning the ingress + upstream-down response with no recovery. Check instant-worker logs for + `jobs.deploy_idle_scaler` and the api `deploy.wake.scale_failed` lines, and + the health of the instant-deploy-* namespaces. Source: + worker/internal/jobs/deploy_idle_scaler.go + api deploy_wake.go. + - alert: DeployScaleToZeroScaleDownFailures + expr: | + sum(increase(instant_deploy_scaled_to_zero_total{outcome="scale_failed"}[30m])) >= 5 + for: 30m + labels: + severity: warning + service: worker + annotations: + summary: "scale-to-zero scale-DOWN failures >= 5 in 30m" + description: | + instant_deploy_scaled_to_zero_total{outcome="scale_failed"} reached >= 5 in + 30m. The idle-scaler is repeatedly failing to deschedule idle apps (k8s patch + or DB flip error). Apps stay running (no customer impact) but the compute + savings are not landing. Check instant-worker RBAC for deployments/patch and + the k8s API health. Source: worker/internal/jobs/deploy_idle_scaler.go. + # instant-worker — billing reconciler gap detected (Rule 25 sweep 2026-06-04). # instant_billing_reconciler_gap_detected_total is the PRIMARY signal for a # dropped Razorpay webhook: each gap = a team whose teams.plan_tier disagrees diff --git a/newrelic/alerts/deploy-scale-to-zero-fail.json b/newrelic/alerts/deploy-scale-to-zero-fail.json new file mode 100644 index 0000000..b2736ed --- /dev/null +++ b/newrelic/alerts/deploy-scale-to-zero-fail.json @@ -0,0 +1,31 @@ +{ + "name": "instant-worker — scale-to-zero wake failures (15m) [Task #54]", + "type": "NRQL", + "description": "Fires when the scale-to-zero idle-scaler reports wake failures. instant_deploy_scaled_to_zero_total{outcome=\"wake_failed\"} is emitted by worker/internal/jobs/deploy_idle_scaler.go (and the api wake endpoint records its own failures). A wake_failed event means a scaled-to-zero app could not be brought back to replicas=1 — a customer request to a sleeping app is returning the ingress upstream-down response with no recovery, so this is P1 (user-visible, recoverable). The whole feature is INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED (default off), so this alert sits quiet until scale-to-zero is canaried on. The query is derivative(...,15 minutes) per outcome (NR ingests the counter as a cumulative monotonic OTLP sum); ABOVE 0 over a 15m window pages on any wake failure. A separate scale_failed outcome (scale-DOWN failures, no customer impact) is a P2 surfaced on the dashboard tile only. Source: worker/internal/jobs/deploy_idle_scaler.go; counter DeployScaledToZeroTotal in worker/internal/metrics/metrics.go.", + "enabled": true, + "nrql": { + "query": "SELECT derivative(instant_deploy_scaled_to_zero_total, 15 minutes) FROM Metric WHERE service = 'worker' AND outcome = 'wake_failed' FACET outcome" + }, + "terms": [ + { + "priority": "CRITICAL", + "operator": "ABOVE", + "threshold": 0, + "thresholdDuration": 600, + "thresholdOccurrences": "AT_LEAST_ONCE" + } + ], + "signal": { + "aggregationWindow": 300, + "aggregationMethod": "EVENT_FLOW", + "aggregationDelay": 180, + "fillOption": "STATIC", + "fillValue": 0 + }, + "expiration": { + "expirationDuration": 3600, + "openViolationOnExpiration": false, + "closeViolationsOnExpiration": true + }, + "violationTimeLimitSeconds": 86400 +} diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json index 71391fd..a4d0475 100644 --- a/newrelic/dashboards/instanode-reliability.json +++ b/newrelic/dashboards/instanode-reliability.json @@ -1,5 +1,5 @@ { - "name": "instanode — reliability", + "name": "instanode \u2014 reliability", "description": "Single-pane reliability view rolling up every new metric shipped in the 2026-05-20 observability sweep: /readyz across all 3 services, propagation_runner queue health, orphan_sweep activity, magic-link rate-limiting, Brevo send/error/webhook funnel, provisioner circuit-breaker state, missing-renderer events, and storage presign throughput. Source: same Metric + Log feeds as the per-surface dashboards. Apply via newrelic/apply.sh.", "permissions": "PUBLIC_READ_WRITE", "pages": [ @@ -149,7 +149,7 @@ } }, { - "title": "Orphan sweep — reaped by reason (24h)", + "title": "Orphan sweep \u2014 reaped by reason (24h)", "layout": { "column": 1, "row": 7, @@ -174,7 +174,7 @@ } }, { - "title": "Orphan sweep — reap failures by reason (24h)", + "title": "Orphan sweep \u2014 reap failures by reason (24h)", "layout": { "column": 7, "row": 7, @@ -334,7 +334,7 @@ } }, { - "title": "Email failover outcomes (1h) — fallback_ok=primary degraded (P1), all_failed=email lost (P0)", + "title": "Email failover outcomes (1h) \u2014 fallback_ok=primary degraded (P1), all_failed=email lost (P0)", "layout": { "column": 1, "row": 62, @@ -409,7 +409,7 @@ } }, { - "title": "Billing charge undeliverable (paid, NOT upgraded) — must be 0", + "title": "Billing charge undeliverable (paid, NOT upgraded) \u2014 must be 0", "layout": { "column": 1, "row": 19, @@ -440,7 +440,7 @@ } }, { - "title": "Tier-upgrade TTL promote outcomes (24h) — error must be 0", + "title": "Tier-upgrade TTL promote outcomes (24h) \u2014 error must be 0", "layout": { "column": 1, "row": 25, @@ -490,7 +490,7 @@ } }, { - "title": "Idempotency replay refunds by route (1h) — FINDING API-1", + "title": "Idempotency replay refunds by route (1h) \u2014 FINDING API-1", "layout": { "column": 1, "row": 22, @@ -515,7 +515,7 @@ } }, { - "title": "AUTH-004 synthetic prober — outcomes per leg (1h)", + "title": "AUTH-004 synthetic prober \u2014 outcomes per leg (1h)", "layout": { "column": 1, "row": 25, @@ -540,7 +540,7 @@ } }, { - "title": "AUTH-004 synthetic prober — fails (last 1h, must be 0)", + "title": "AUTH-004 synthetic prober \u2014 fails (last 1h, must be 0)", "layout": { "column": 7, "row": 25, @@ -571,7 +571,7 @@ } }, { - "title": "AUTH-004 synthetic prober — P95 latency per leg (1h)", + "title": "AUTH-004 synthetic prober \u2014 P95 latency per leg (1h)", "layout": { "column": 10, "row": 25, @@ -596,7 +596,7 @@ } }, { - "title": "Hourly deploy prober — outcomes per leg (6h)", + "title": "Hourly deploy prober \u2014 outcomes per leg (6h)", "layout": { "column": 1, "row": 28, @@ -621,7 +621,7 @@ } }, { - "title": "Hourly deploy prober — fails (last 6h, must be 0)", + "title": "Hourly deploy prober \u2014 fails (last 6h, must be 0)", "layout": { "column": 7, "row": 28, @@ -652,7 +652,7 @@ } }, { - "title": "Hourly deploy prober — P95 latency per leg (6h)", + "title": "Hourly deploy prober \u2014 P95 latency per leg (6h)", "layout": { "column": 10, "row": 28, @@ -677,7 +677,7 @@ } }, { - "title": "GitHub webhook — received by event+result (6h) [P4 push-to-deploy]", + "title": "GitHub webhook \u2014 received by event+result (6h) [P4 push-to-deploy]", "layout": { "column": 1, "row": 31, @@ -702,7 +702,7 @@ } }, { - "title": "GitHub webhook — bad_signature count (1h, must be 0 in steady state)", + "title": "GitHub webhook \u2014 bad_signature count (1h, must be 0 in steady state)", "layout": { "column": 7, "row": 31, @@ -733,7 +733,7 @@ } }, { - "title": "GitHub push-to-deploy — enqueued vs errors (6h)", + "title": "GitHub push-to-deploy \u2014 enqueued vs errors (6h)", "layout": { "column": 10, "row": 31, @@ -764,7 +764,7 @@ } }, { - "title": "GitHub push-to-deploy — result breakdown (6h)", + "title": "GitHub push-to-deploy \u2014 result breakdown (6h)", "layout": { "column": 1, "row": 34, @@ -789,7 +789,7 @@ } }, { - "title": "GitHub App token mint — result breakdown (6h)", + "title": "GitHub App token mint \u2014 result breakdown (6h)", "layout": { "column": 7, "row": 34, @@ -951,7 +951,7 @@ } }, { - "title": "Postgres pool saturation ratio by service+pool (3h) — alert > 0.8", + "title": "Postgres pool saturation ratio by service+pool (3h) \u2014 alert > 0.8", "layout": { "column": 1, "row": 43, @@ -976,7 +976,7 @@ } }, { - "title": "Postgres pool peak saturation (1h, in_use/max) — alert > 0.8", + "title": "Postgres pool peak saturation (1h, in_use/max) \u2014 alert > 0.8", "layout": { "column": 7, "row": 43, @@ -1038,7 +1038,7 @@ } }, { - "title": "Redis maxmemory regrade failures (6h) — quota not enforced when > 0", + "title": "Redis maxmemory regrade failures (6h) \u2014 quota not enforced when > 0", "layout": { "column": 1, "row": 46, @@ -1063,7 +1063,7 @@ } }, { - "title": "Expiry deprovision failures (6h) — orphan infra accumulating when > 0", + "title": "Expiry deprovision failures (6h) \u2014 orphan infra accumulating when > 0", "layout": { "column": 7, "row": 46, @@ -1113,7 +1113,7 @@ } }, { - "title": "Pool reaper — actions by status/outcome (24h)", + "title": "Pool reaper \u2014 actions by status/outcome (24h)", "layout": { "column": 7, "row": 49, @@ -1138,7 +1138,7 @@ } }, { - "title": "Pool items stuck 'assigned' past 30m — leaked shared infra (must be 0)", + "title": "Pool items stuck 'assigned' past 30m \u2014 leaked shared infra (must be 0)", "layout": { "column": 1, "row": 52, @@ -1163,7 +1163,7 @@ } }, { - "title": "Flow matrix — latest result per flow×actor (synthetic; red=fail)", + "title": "Flow matrix \u2014 latest result per flow\u00d7actor (synthetic; red=fail)", "layout": { "column": 1, "row": 56, @@ -1188,7 +1188,7 @@ } }, { - "title": "Flow matrix — fails by flow (1h, must be 0)", + "title": "Flow matrix \u2014 fails by flow (1h, must be 0)", "layout": { "column": 7, "row": 56, @@ -1213,7 +1213,7 @@ } }, { - "title": "Flow synthetic — leaked reaps (1h, must be 0)", + "title": "Flow synthetic \u2014 leaked reaps (1h, must be 0)", "layout": { "column": 10, "row": 56, @@ -1238,7 +1238,7 @@ } }, { - "title": "Flow matrix — P95 latency per flow (6h)", + "title": "Flow matrix \u2014 P95 latency per flow (6h)", "layout": { "column": 1, "row": 59, @@ -1263,7 +1263,7 @@ } }, { - "title": "Flow matrix — distinct flows reporting (15m; silent-death watch)", + "title": "Flow matrix \u2014 distinct flows reporting (15m; silent-death watch)", "layout": { "column": 7, "row": 59, @@ -1286,6 +1286,56 @@ "ignoreTimeRange": false } } + }, + { + "title": "Scale-to-zero \u2014 apps currently asleep (replicas=0)", + "layout": { + "column": 1, + "row": 62, + "width": 3, + "height": 3 + }, + "visualization": { + "id": "viz.billboard" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT latest(instant_deploy_idle_apps) AS 'apps asleep' FROM Metric WHERE service = 'worker' SINCE 10 minutes ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + }, + { + "title": "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)", + "layout": { + "column": 4, + "row": 62, + "width": 9, + "height": 3 + }, + "visualization": { + "id": "viz.stacked-bar" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT rate(sum(instant_deploy_scaled_to_zero_total), 1 minute) FROM Metric WHERE service = 'worker' FACET outcome TIMESERIES SINCE 6 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } } ] } diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md index 7c830a9..15e536c 100644 --- a/observability/METRICS-CATALOG.md +++ b/observability/METRICS-CATALOG.md @@ -54,6 +54,8 @@ fires. Operators need this so they don't panic when a fresh deploy looks | `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" | | `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" | | `instant_billing_reconciler_gap_detected_total` | worker | `direction` | lazy (CounterVec — direction ∈ {upgrade, downgrade}; series materialise on the first detected mismatch between Razorpay subscription state and teams.plan_tier. The primary signal for a dropped Razorpay webhook) | `billing-reconciler-gap-detected.json` | `BillingReconcilerGapDetected` (instant-worker-billing-gap group) | "Billing reconciler gap detected by direction (6h)" | +| `instant_deploy_scaled_to_zero_total` | worker | `outcome` | lazy (CounterVec — outcome ∈ {scaled_down, woke_up, wake_failed, scale_failed}; all four primed in metrics_test so the series register at boot. INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED. scaled_down = idle app descheduled to replicas=0 (~$0 compute, the savings path); wake_failed = app stuck asleep (P1, user-visible); scale_failed = scale-DOWN k8s/DB error, row untouched + retried (P2). Task #54) | `deploy-scale-to-zero-fail.json` (wake_failed) | `DeployScaleToZeroWakeFailed` + `DeployScaleToZeroScaleDownFailures` (instant-worker-deploy-scale-to-zero group) | "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)" | +| `instant_deploy_idle_apps` | worker | (none) | **eager** (Gauge — sampled at the end of every idle-scaler tick; the count of deployments currently scaled_to_zero=true. Headline "how much compute scale-to-zero is reclaiming" signal. Stays 0 until DEPLOY_SCALE_TO_ZERO_ENABLED is on. Task #54) | (no standalone alert — capacity signal, not a fault) | (no standalone rule) | "Scale-to-zero — apps currently asleep (replicas=0)" | | `instant_pg_pool_in_use` / `instant_pg_pool_max` | api + worker + provisioner | `pool` | **eager** (GaugeVec — sampled every 5s by each process's pool-stats exporter; `pool` label e.g. `platform_db`. Saturation ratio = in_use/max. Wave-3 chaos-verify 2026-05-21) | `pg-pool-saturation.json` | `PGPoolSaturation` (instant-pg-pool group) | "Postgres pool saturation ratio by service+pool (3h) — alert > 0.8", "Postgres pool peak saturation (1h, in_use/max) — alert > 0.8" | | `instant_redis_maxmemory_failed_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; increments when the A4 reconciler's provisioner RegradeResource gRPC call fails. Distinct from `_skipped_total` soft-skips. Quota not enforced on dedicated Redis pods when > 0) | `redis-maxmemory-regrade-failed.json` | `RedisMaxmemoryRegradeFailed` (instant-worker-redis-maxmemory group) | "Redis maxmemory regrade failures (6h) — quota not enforced when > 0" | | `instant_expire_deprovision_failed_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; increments when the 24h-TTL reaper's provisioner DeprovisionResource call errors. Per MR-P0-1a the row is left reapable for retry — orphan infra accumulating when sustained > 0) | `expire-deprovision-failed.json` | `ExpireDeprovisionFailed` (instant-worker-expire-deprovision group) | "Expiry deprovision failures (6h) — orphan infra accumulating when > 0" |