InstaNode-dev · mastermanas805 · Jun 8, 2026 · Jun 8, 2026
diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml
@@ -981,6 +981,42 @@ spec:
               build pod logs in instant-deploy-* namespaces. Source:
               worker/internal/jobs/deploy_status_reconcile.go.
 
+    # instant-worker — runtime rollout-failure detector (2026-06-08). Twin of the
+    # build-Job detector above, for the RUNTIME side: the build SUCCEEDED but the
+    # produced image can't start (CreateContainerError "no command specified",
+    # ImagePullBackOff, CrashLoopBackOff). deploy_status_reconcile now maps a
+    # ProgressDeadlineExceeded rollout with no available replica to 'failed'
+    # (was 'deploying' forever). instant_deploy_runtime_failed_detected_total
+    # {reason} is emitted by worker/internal/jobs/deploy_status_reconcile.go.
+    - name: instant-worker-deploy-runtime-failed
+      rules:
+        - alert: DeployRuntimeFailedDetected
+          # Absolute count over a 30m window (same posture as DeployJobFailedDetected).
+          # A single broken-image deploy is per-customer and expected at a low rate;
+          # >= 3 in 30m means a PLATFORM build defect is producing unstartable images
+          # (e.g. empty 474-byte images with no CMD — the 2026-06-08 live symptom)
+          # or a registry/pull-secret regression. Matches
+          # newrelic/alerts/deploy-runtime-failed-detected.json.
+          expr: |
+            sum(increase(instant_deploy_runtime_failed_detected_total[30m])) >= 3
+          for: 5m
+          labels:
+            severity: critical
+            service: worker
+          annotations:
+            summary: "deploys failing to start at runtime — >= 3 in 30m, likely a platform image defect"
+            description: |
+              sum(increase(instant_deploy_runtime_failed_detected_total[30m])) >= 3.
+              deploy_status_reconcile is flipping rollouts to 'failed' because they
+              exceeded their progress deadline with no available replica — the pods were
+              created but their containers can't start (broken built image: no
+              CMD/ENTRYPOINT, ImagePullBackOff, or CrashLoopBackOff). A platform-wide
+              spike means the build pipeline is producing unstartable images (check for
+              empty/near-zero-byte images in ghcr.io/.../instant-userapp/*) or a
+              registry/pull-secret regression. Per-customer this is a one-off bad image.
+              Source: worker/internal/jobs/deploy_status_reconcile.go (counter
+              DeployRuntimeFailedDetectedTotal); user-facing reason=StartFailed.
+
     # instant-worker — scale-to-zero idle-scaler (Task #54).
     # instant_deploy_scaled_to_zero_total{outcome} is emitted by
     # worker/internal/jobs/deploy_idle_scaler.go. INERT until an operator sets

diff --git a/newrelic/alerts/deploy-runtime-failed-detected.json b/newrelic/alerts/deploy-runtime-failed-detected.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-worker — deploy runtime start-failures detected (30m) [silent-deploy-failure backstop]",
+  "type": "NRQL",
+  "description": "Fires when deploy_status_reconcile flips rollouts to 'failed' because they exceeded their progress deadline with no available replica — the RUNTIME twin of deploy-job-failed-detected.json (CLAUDE.md rule 27). The build SUCCEEDED but the produced image can't start: CreateContainerError ('no command specified' from an empty/near-zero-byte image), ImagePullBackOff, or CrashLoopBackOff. Before the 2026-06-08 fix these deploys reported 'deploying' forever (deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts), so no autopsy ran and the user got no failure email; now a Progressing=False/ProgressDeadlineExceeded rollout maps to 'failed' (user-facing reason=StartFailed). Per-customer this is a one-off bad image (low rate, expected). A platform-wide spike means the build pipeline is producing unstartable images (the 2026-06-08 live symptom was 6 deploys at once pulling 474-byte empty images) or a registry/pull-secret regression. P1: user-visible, recoverable. The query is derivative(...,30 minutes) per reason (NR ingests the counter as a cumulative monotonic OTLP sum). ABOVE_OR_EQUALS 3 over a 30m window pages on a small cluster of runtime start-failures while tolerating one or two individual bad images. Source: worker/internal/jobs/deploy_status_reconcile.go; counter DeployRuntimeFailedDetectedTotal in worker/internal/metrics/metrics.go.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT derivative(instant_deploy_runtime_failed_detected_total, 30 minutes) FROM Metric WHERE service = 'worker' FACET reason"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE_OR_EQUALS",
+      "threshold": 3,
+      "thresholdDuration": 300,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 180,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 3600,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json
@@ -925,6 +925,62 @@
             }
           }
         },
+        {
+          "title": "Deploy runtime start-failures (1h, detected; must be 0 in steady state)",
+          "layout": {
+            "column": 1,
+            "row": 78,
+            "width": 3,
+            "height": 3
+          },
+          "visualization": {
+            "id": "viz.billboard"
+          },
+          "rawConfiguration": {
+            "nrqlQueries": [
+              {
+                "accountIds": [
+                  0
+                ],
+                "query": "SELECT derivative(instant_deploy_runtime_failed_detected_total, 1 hour) AS 'runtime start-failures (1h)' FROM Metric WHERE service = 'worker' SINCE 1 hour ago"
+              }
+            ],
+            "platformOptions": {
+              "ignoreTimeRange": false
+            },
+            "thresholds": [
+              {
+                "alertSeverity": "WARNING",
+                "value": 1
+              }
+            ]
+          }
+        },
+        {
+          "title": "Deploy runtime start-failures by reason (6h)",
+          "layout": {
+            "column": 4,
+            "row": 78,
+            "width": 6,
+            "height": 3
+          },
+          "visualization": {
+            "id": "viz.stacked-bar"
+          },
+          "rawConfiguration": {
+            "nrqlQueries": [
+              {
+                "accountIds": [
+                  0
+                ],
+                "query": "SELECT rate(sum(instant_deploy_runtime_failed_detected_total), 1 minute) FROM Metric WHERE service = 'worker' FACET reason TIMESERIES SINCE 6 hours ago"
+              }
+            ],
+            "platformOptions": {
+              "ignoreTimeRange": false
+            }
+          }
+        },
         {
           "title": "Billing reconciler gap detected by direction (6h)",
           "layout": {

diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md
@@ -55,6 +55,7 @@ fires. Operators need this so they don't panic when a fresh deploy looks
 | `instant_entitlement_drift_detected_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts Postgres resources found drifted below their team's plan tier per sweep) | `entitlement-drift-outpacing-regrade.json` (paired with `_regraded_total`) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
 | `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
 | `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" |
+| `instant_deploy_runtime_failed_detected_total` | worker | `reason` | lazy (CounterVec — runtime twin of `_job_failed_detected_total`; first observation is a rollout flipped to failed on ProgressDeadlineExceeded with no available replica (broken image can't start: CreateContainerError "no command specified" / ImagePullBackOff / CrashLoopBackOff). reason currently only "progress_deadline_exceeded"; metrics_test primes it so it registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-06-08) | `deploy-runtime-failed-detected.json` | `DeployRuntimeFailedDetected` (instant-worker-deploy-runtime-failed group) | "Deploy runtime start-failures by reason (6h)", "Deploy runtime start-failures (1h, detected; must be 0 in steady state)" |
 | `instant_billing_reconciler_gap_detected_total` | worker | `direction` | lazy (CounterVec — direction ∈ {upgrade, downgrade}; series materialise on the first detected mismatch between Razorpay subscription state and teams.plan_tier. The primary signal for a dropped Razorpay webhook) | `billing-reconciler-gap-detected.json` | `BillingReconcilerGapDetected` (instant-worker-billing-gap group) | "Billing reconciler gap detected by direction (6h)" |
 | `instant_deploy_scaled_to_zero_total` | worker | `outcome` | lazy (CounterVec — outcome ∈ {scaled_down, woke_up, wake_failed, scale_failed}; all four primed in metrics_test so the series register at boot. INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED. scaled_down = idle app descheduled to replicas=0 (~$0 compute, the savings path); wake_failed = app stuck asleep (P1, user-visible); scale_failed = scale-DOWN k8s/DB error, row untouched + retried (P2). Task #54) | `deploy-scale-to-zero-fail.json` (wake_failed) | `DeployScaleToZeroWakeFailed` + `DeployScaleToZeroScaleDownFailures` (instant-worker-deploy-scale-to-zero group) | "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)" |
 | `instant_deploy_idle_apps` | worker | (none) | **eager** (Gauge — sampled at the end of every idle-scaler tick; the count of deployments currently scaled_to_zero=true. Headline "how much compute scale-to-zero is reclaiming" signal. Stays 0 until DEPLOY_SCALE_TO_ZERO_ENABLED is on. Task #54) | (no standalone alert — capacity signal, not a fault) | (no standalone rule) | "Scale-to-zero — apps currently asleep (replicas=0)" |