From 9f9eea079790fba91a6144fe320405e2471b8afc Mon Sep 17 00:00:00 2001
From: Manas Srivastava <mastermanas805@gmail.com>
Date: Fri, 5 Jun 2026 19:43:52 +0530
Subject: [PATCH] feat(observability): scale-to-zero metric alert + tile +
 catalog (Task #54)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Rule 25 artifacts for instant_deploy_scaled_to_zero_total{outcome} +
instant_deploy_idle_apps (emitted by worker/internal/jobs/deploy_idle_scaler.go).

- Prom rules (instant-worker-deploy-scale-to-zero group):
  DeployScaleToZeroWakeFailed (wake_failed > 0 / 15m → P1: app stuck asleep),
  DeployScaleToZeroScaleDownFailures (scale_failed >= 5 / 30m → P2: savings not
  landing, no customer impact).
- NR alert: deploy-scale-to-zero-fail.json (wake_failed → CRITICAL).
- Dashboard: two tiles on instanode-reliability.json — asleep-apps billboard +
  scale-outcome stacked-bar.
- METRICS-CATALOG.md: rows for the counter (lazy, 4 outcomes primed) + the
  gauge (eager).

All INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED — the series stay
0 and the alerts stay quiet until the feature is canaried on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 k8s/prometheus-rules.yaml                     |  42 +++++++
 .../alerts/deploy-scale-to-zero-fail.json     |  31 +++++
 .../dashboards/instanode-reliability.json     | 108 +++++++++++++-----
 observability/METRICS-CATALOG.md              |   2 +
 4 files changed, 154 insertions(+), 29 deletions(-)
 create mode 100644 newrelic/alerts/deploy-scale-to-zero-fail.json

diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml
index 14d1661..98a1a51 100644
--- a/k8s/prometheus-rules.yaml
+++ b/k8s/prometheus-rules.yaml
@@ -917,6 +917,48 @@ spec:
               build pod logs in instant-deploy-* namespaces. Source:
               worker/internal/jobs/deploy_status_reconcile.go.
 
+    # instant-worker — scale-to-zero idle-scaler (Task #54).
+    # instant_deploy_scaled_to_zero_total{outcome} is emitted by
+    # worker/internal/jobs/deploy_idle_scaler.go. INERT until an operator sets
+    # DEPLOY_SCALE_TO_ZERO_ENABLED (default off), so these rules sit quiet until
+    # the feature is canaried on. wake_failed = a user's app may be stuck asleep
+    # (P1, user-visible recoverable); scale_failed = a scale-DOWN k8s patch / DB
+    # flip failed (P2, the row is left untouched + retried next tick).
+    - name: instant-worker-deploy-scale-to-zero
+      rules:
+        - alert: DeployScaleToZeroWakeFailed
+          expr: |
+            sum(increase(instant_deploy_scaled_to_zero_total{outcome="wake_failed"}[15m])) > 0
+          for: 10m
+          labels:
+            severity: critical
+            service: worker
+          annotations:
+            summary: "scale-to-zero wake failures > 0 (15m) — an app may be stuck asleep"
+            description: |
+              instant_deploy_scaled_to_zero_total{outcome="wake_failed"} increased over the
+              last 15m, sustained 10m. A scaled-to-zero app failed to wake (k8s scale-up
+              error), so a customer request to a sleeping app is returning the ingress
+              upstream-down response with no recovery. Check instant-worker logs for
+              `jobs.deploy_idle_scaler` and the api `deploy.wake.scale_failed` lines, and
+              the health of the instant-deploy-* namespaces. Source:
+              worker/internal/jobs/deploy_idle_scaler.go + api deploy_wake.go.
+        - alert: DeployScaleToZeroScaleDownFailures
+          expr: |
+            sum(increase(instant_deploy_scaled_to_zero_total{outcome="scale_failed"}[30m])) >= 5
+          for: 30m
+          labels:
+            severity: warning
+            service: worker
+          annotations:
+            summary: "scale-to-zero scale-DOWN failures >= 5 in 30m"
+            description: |
+              instant_deploy_scaled_to_zero_total{outcome="scale_failed"} reached >= 5 in
+              30m. The idle-scaler is repeatedly failing to deschedule idle apps (k8s patch
+              or DB flip error). Apps stay running (no customer impact) but the compute
+              savings are not landing. Check instant-worker RBAC for deployments/patch and
+              the k8s API health. Source: worker/internal/jobs/deploy_idle_scaler.go.
+
     # instant-worker — billing reconciler gap detected (Rule 25 sweep 2026-06-04).
     # instant_billing_reconciler_gap_detected_total is the PRIMARY signal for a
     # dropped Razorpay webhook: each gap = a team whose teams.plan_tier disagrees
diff --git a/newrelic/alerts/deploy-scale-to-zero-fail.json b/newrelic/alerts/deploy-scale-to-zero-fail.json
new file mode 100644
index 0000000..b2736ed
--- /dev/null
+++ b/newrelic/alerts/deploy-scale-to-zero-fail.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-worker — scale-to-zero wake failures (15m) [Task #54]",
+  "type": "NRQL",
+  "description": "Fires when the scale-to-zero idle-scaler reports wake failures. instant_deploy_scaled_to_zero_total{outcome=\"wake_failed\"} is emitted by worker/internal/jobs/deploy_idle_scaler.go (and the api wake endpoint records its own failures). A wake_failed event means a scaled-to-zero app could not be brought back to replicas=1 — a customer request to a sleeping app is returning the ingress upstream-down response with no recovery, so this is P1 (user-visible, recoverable). The whole feature is INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED (default off), so this alert sits quiet until scale-to-zero is canaried on. The query is derivative(...,15 minutes) per outcome (NR ingests the counter as a cumulative monotonic OTLP sum); ABOVE 0 over a 15m window pages on any wake failure. A separate scale_failed outcome (scale-DOWN failures, no customer impact) is a P2 surfaced on the dashboard tile only. Source: worker/internal/jobs/deploy_idle_scaler.go; counter DeployScaledToZeroTotal in worker/internal/metrics/metrics.go.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT derivative(instant_deploy_scaled_to_zero_total, 15 minutes) FROM Metric WHERE service = 'worker' AND outcome = 'wake_failed' FACET outcome"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 600,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 180,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 3600,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json
index 71391fd..a4d0475 100644
--- a/newrelic/dashboards/instanode-reliability.json
+++ b/newrelic/dashboards/instanode-reliability.json
@@ -1,5 +1,5 @@
 {
-  "name": "instanode — reliability",
+  "name": "instanode \u2014 reliability",
   "description": "Single-pane reliability view rolling up every new metric shipped in the 2026-05-20 observability sweep: /readyz across all 3 services, propagation_runner queue health, orphan_sweep activity, magic-link rate-limiting, Brevo send/error/webhook funnel, provisioner circuit-breaker state, missing-renderer events, and storage presign throughput. Source: same Metric + Log feeds as the per-surface dashboards. Apply via newrelic/apply.sh.",
   "permissions": "PUBLIC_READ_WRITE",
   "pages": [
@@ -149,7 +149,7 @@
           }
         },
         {
-          "title": "Orphan sweep — reaped by reason (24h)",
+          "title": "Orphan sweep \u2014 reaped by reason (24h)",
           "layout": {
             "column": 1,
             "row": 7,
@@ -174,7 +174,7 @@
           }
         },
         {
-          "title": "Orphan sweep — reap failures by reason (24h)",
+          "title": "Orphan sweep \u2014 reap failures by reason (24h)",
           "layout": {
             "column": 7,
             "row": 7,
@@ -334,7 +334,7 @@
           }
         },
         {
-          "title": "Email failover outcomes (1h) — fallback_ok=primary degraded (P1), all_failed=email lost (P0)",
+          "title": "Email failover outcomes (1h) \u2014 fallback_ok=primary degraded (P1), all_failed=email lost (P0)",
           "layout": {
             "column": 1,
             "row": 62,
@@ -409,7 +409,7 @@
           }
         },
         {
-          "title": "Billing charge undeliverable (paid, NOT upgraded) — must be 0",
+          "title": "Billing charge undeliverable (paid, NOT upgraded) \u2014 must be 0",
           "layout": {
             "column": 1,
             "row": 19,
@@ -440,7 +440,7 @@
           }
         },
         {
-          "title": "Tier-upgrade TTL promote outcomes (24h) — error must be 0",
+          "title": "Tier-upgrade TTL promote outcomes (24h) \u2014 error must be 0",
           "layout": {
             "column": 1,
             "row": 25,
@@ -490,7 +490,7 @@
           }
         },
         {
-          "title": "Idempotency replay refunds by route (1h) — FINDING API-1",
+          "title": "Idempotency replay refunds by route (1h) \u2014 FINDING API-1",
           "layout": {
             "column": 1,
             "row": 22,
@@ -515,7 +515,7 @@
           }
         },
         {
-          "title": "AUTH-004 synthetic prober — outcomes per leg (1h)",
+          "title": "AUTH-004 synthetic prober \u2014 outcomes per leg (1h)",
           "layout": {
             "column": 1,
             "row": 25,
@@ -540,7 +540,7 @@
           }
         },
         {
-          "title": "AUTH-004 synthetic prober — fails (last 1h, must be 0)",
+          "title": "AUTH-004 synthetic prober \u2014 fails (last 1h, must be 0)",
           "layout": {
             "column": 7,
             "row": 25,
@@ -571,7 +571,7 @@
           }
         },
         {
-          "title": "AUTH-004 synthetic prober — P95 latency per leg (1h)",
+          "title": "AUTH-004 synthetic prober \u2014 P95 latency per leg (1h)",
           "layout": {
             "column": 10,
             "row": 25,
@@ -596,7 +596,7 @@
           }
         },
         {
-          "title": "Hourly deploy prober — outcomes per leg (6h)",
+          "title": "Hourly deploy prober \u2014 outcomes per leg (6h)",
           "layout": {
             "column": 1,
             "row": 28,
@@ -621,7 +621,7 @@
           }
         },
         {
-          "title": "Hourly deploy prober — fails (last 6h, must be 0)",
+          "title": "Hourly deploy prober \u2014 fails (last 6h, must be 0)",
           "layout": {
             "column": 7,
             "row": 28,
@@ -652,7 +652,7 @@
           }
         },
         {
-          "title": "Hourly deploy prober — P95 latency per leg (6h)",
+          "title": "Hourly deploy prober \u2014 P95 latency per leg (6h)",
           "layout": {
             "column": 10,
             "row": 28,
@@ -677,7 +677,7 @@
           }
         },
         {
-          "title": "GitHub webhook — received by event+result (6h) [P4 push-to-deploy]",
+          "title": "GitHub webhook \u2014 received by event+result (6h) [P4 push-to-deploy]",
           "layout": {
             "column": 1,
             "row": 31,
@@ -702,7 +702,7 @@
           }
         },
         {
-          "title": "GitHub webhook — bad_signature count (1h, must be 0 in steady state)",
+          "title": "GitHub webhook \u2014 bad_signature count (1h, must be 0 in steady state)",
           "layout": {
             "column": 7,
             "row": 31,
@@ -733,7 +733,7 @@
           }
         },
         {
-          "title": "GitHub push-to-deploy — enqueued vs errors (6h)",
+          "title": "GitHub push-to-deploy \u2014 enqueued vs errors (6h)",
           "layout": {
             "column": 10,
             "row": 31,
@@ -764,7 +764,7 @@
           }
         },
         {
-          "title": "GitHub push-to-deploy — result breakdown (6h)",
+          "title": "GitHub push-to-deploy \u2014 result breakdown (6h)",
           "layout": {
             "column": 1,
             "row": 34,
@@ -789,7 +789,7 @@
           }
         },
         {
-          "title": "GitHub App token mint — result breakdown (6h)",
+          "title": "GitHub App token mint \u2014 result breakdown (6h)",
           "layout": {
             "column": 7,
             "row": 34,
@@ -951,7 +951,7 @@
           }
         },
         {
-          "title": "Postgres pool saturation ratio by service+pool (3h) — alert > 0.8",
+          "title": "Postgres pool saturation ratio by service+pool (3h) \u2014 alert > 0.8",
           "layout": {
             "column": 1,
             "row": 43,
@@ -976,7 +976,7 @@
           }
         },
         {
-          "title": "Postgres pool peak saturation (1h, in_use/max) — alert > 0.8",
+          "title": "Postgres pool peak saturation (1h, in_use/max) \u2014 alert > 0.8",
           "layout": {
             "column": 7,
             "row": 43,
@@ -1038,7 +1038,7 @@
           }
         },
         {
-          "title": "Redis maxmemory regrade failures (6h) — quota not enforced when > 0",
+          "title": "Redis maxmemory regrade failures (6h) \u2014 quota not enforced when > 0",
           "layout": {
             "column": 1,
             "row": 46,
@@ -1063,7 +1063,7 @@
           }
         },
         {
-          "title": "Expiry deprovision failures (6h) — orphan infra accumulating when > 0",
+          "title": "Expiry deprovision failures (6h) \u2014 orphan infra accumulating when > 0",
           "layout": {
             "column": 7,
             "row": 46,
@@ -1113,7 +1113,7 @@
           }
         },
         {
-          "title": "Pool reaper — actions by status/outcome (24h)",
+          "title": "Pool reaper \u2014 actions by status/outcome (24h)",
           "layout": {
             "column": 7,
             "row": 49,
@@ -1138,7 +1138,7 @@
           }
         },
         {
-          "title": "Pool items stuck 'assigned' past 30m — leaked shared infra (must be 0)",
+          "title": "Pool items stuck 'assigned' past 30m \u2014 leaked shared infra (must be 0)",
           "layout": {
             "column": 1,
             "row": 52,
@@ -1163,7 +1163,7 @@
           }
         },
         {
-          "title": "Flow matrix — latest result per flow×actor (synthetic; red=fail)",
+          "title": "Flow matrix \u2014 latest result per flow\u00d7actor (synthetic; red=fail)",
           "layout": {
             "column": 1,
             "row": 56,
@@ -1188,7 +1188,7 @@
           }
         },
         {
-          "title": "Flow matrix — fails by flow (1h, must be 0)",
+          "title": "Flow matrix \u2014 fails by flow (1h, must be 0)",
           "layout": {
             "column": 7,
             "row": 56,
@@ -1213,7 +1213,7 @@
           }
         },
         {
-          "title": "Flow synthetic — leaked reaps (1h, must be 0)",
+          "title": "Flow synthetic \u2014 leaked reaps (1h, must be 0)",
           "layout": {
             "column": 10,
             "row": 56,
@@ -1238,7 +1238,7 @@
           }
         },
         {
-          "title": "Flow matrix — P95 latency per flow (6h)",
+          "title": "Flow matrix \u2014 P95 latency per flow (6h)",
           "layout": {
             "column": 1,
             "row": 59,
@@ -1263,7 +1263,7 @@
           }
         },
         {
-          "title": "Flow matrix — distinct flows reporting (15m; silent-death watch)",
+          "title": "Flow matrix \u2014 distinct flows reporting (15m; silent-death watch)",
           "layout": {
             "column": 7,
             "row": 59,
@@ -1286,6 +1286,56 @@
               "ignoreTimeRange": false
             }
           }
+        },
+        {
+          "title": "Scale-to-zero \u2014 apps currently asleep (replicas=0)",
+          "layout": {
+            "column": 1,
+            "row": 62,
+            "width": 3,
+            "height": 3
+          },
+          "visualization": {
+            "id": "viz.billboard"
+          },
+          "rawConfiguration": {
+            "nrqlQueries": [
+              {
+                "accountIds": [
+                  0
+                ],
+                "query": "SELECT latest(instant_deploy_idle_apps) AS 'apps asleep' FROM Metric WHERE service = 'worker' SINCE 10 minutes ago"
+              }
+            ],
+            "platformOptions": {
+              "ignoreTimeRange": false
+            }
+          }
+        },
+        {
+          "title": "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)",
+          "layout": {
+            "column": 4,
+            "row": 62,
+            "width": 9,
+            "height": 3
+          },
+          "visualization": {
+            "id": "viz.stacked-bar"
+          },
+          "rawConfiguration": {
+            "nrqlQueries": [
+              {
+                "accountIds": [
+                  0
+                ],
+                "query": "SELECT rate(sum(instant_deploy_scaled_to_zero_total), 1 minute) FROM Metric WHERE service = 'worker' FACET outcome TIMESERIES SINCE 6 hours ago"
+              }
+            ],
+            "platformOptions": {
+              "ignoreTimeRange": false
+            }
+          }
         }
       ]
     }
diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md
index 7c830a9..15e536c 100644
--- a/observability/METRICS-CATALOG.md
+++ b/observability/METRICS-CATALOG.md
@@ -54,6 +54,8 @@ fires. Operators need this so they don't panic when a fresh deploy looks
 | `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
 | `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" |
 | `instant_billing_reconciler_gap_detected_total` | worker | `direction` | lazy (CounterVec — direction ∈ {upgrade, downgrade}; series materialise on the first detected mismatch between Razorpay subscription state and teams.plan_tier. The primary signal for a dropped Razorpay webhook) | `billing-reconciler-gap-detected.json` | `BillingReconcilerGapDetected` (instant-worker-billing-gap group) | "Billing reconciler gap detected by direction (6h)" |
+| `instant_deploy_scaled_to_zero_total` | worker | `outcome` | lazy (CounterVec — outcome ∈ {scaled_down, woke_up, wake_failed, scale_failed}; all four primed in metrics_test so the series register at boot. INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED. scaled_down = idle app descheduled to replicas=0 (~$0 compute, the savings path); wake_failed = app stuck asleep (P1, user-visible); scale_failed = scale-DOWN k8s/DB error, row untouched + retried (P2). Task #54) | `deploy-scale-to-zero-fail.json` (wake_failed) | `DeployScaleToZeroWakeFailed` + `DeployScaleToZeroScaleDownFailures` (instant-worker-deploy-scale-to-zero group) | "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)" |
+| `instant_deploy_idle_apps` | worker | (none) | **eager** (Gauge — sampled at the end of every idle-scaler tick; the count of deployments currently scaled_to_zero=true. Headline "how much compute scale-to-zero is reclaiming" signal. Stays 0 until DEPLOY_SCALE_TO_ZERO_ENABLED is on. Task #54) | (no standalone alert — capacity signal, not a fault) | (no standalone rule) | "Scale-to-zero — apps currently asleep (replicas=0)" |
 | `instant_pg_pool_in_use` / `instant_pg_pool_max` | api + worker + provisioner | `pool` | **eager** (GaugeVec — sampled every 5s by each process's pool-stats exporter; `pool` label e.g. `platform_db`. Saturation ratio = in_use/max. Wave-3 chaos-verify 2026-05-21) | `pg-pool-saturation.json` | `PGPoolSaturation` (instant-pg-pool group) | "Postgres pool saturation ratio by service+pool (3h) — alert > 0.8", "Postgres pool peak saturation (1h, in_use/max) — alert > 0.8" |
 | `instant_redis_maxmemory_failed_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; increments when the A4 reconciler's provisioner RegradeResource gRPC call fails. Distinct from `_skipped_total` soft-skips. Quota not enforced on dedicated Redis pods when > 0) | `redis-maxmemory-regrade-failed.json` | `RedisMaxmemoryRegradeFailed` (instant-worker-redis-maxmemory group) | "Redis maxmemory regrade failures (6h) — quota not enforced when > 0" |
 | `instant_expire_deprovision_failed_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; increments when the 24h-TTL reaper's provisioner DeprovisionResource call errors. Per MR-P0-1a the row is left reapable for retry — orphan infra accumulating when sustained > 0) | `expire-deprovision-failed.json` | `ExpireDeprovisionFailed` (instant-worker-expire-deprovision group) | "Expiry deprovision failures (6h) — orphan infra accumulating when > 0" |