diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml index 52ea0b6..6fb500a 100644 --- a/k8s/prometheus-rules.yaml +++ b/k8s/prometheus-rules.yaml @@ -981,6 +981,42 @@ spec: build pod logs in instant-deploy-* namespaces. Source: worker/internal/jobs/deploy_status_reconcile.go. + # instant-worker — runtime rollout-failure detector (2026-06-08). Twin of the + # build-Job detector above, for the RUNTIME side: the build SUCCEEDED but the + # produced image can't start (CreateContainerError "no command specified", + # ImagePullBackOff, CrashLoopBackOff). deploy_status_reconcile now maps a + # ProgressDeadlineExceeded rollout with no available replica to 'failed' + # (was 'deploying' forever). instant_deploy_runtime_failed_detected_total + # {reason} is emitted by worker/internal/jobs/deploy_status_reconcile.go. + - name: instant-worker-deploy-runtime-failed + rules: + - alert: DeployRuntimeFailedDetected + # Absolute count over a 30m window (same posture as DeployJobFailedDetected). + # A single broken-image deploy is per-customer and expected at a low rate; + # >= 3 in 30m means a PLATFORM build defect is producing unstartable images + # (e.g. empty 474-byte images with no CMD — the 2026-06-08 live symptom) + # or a registry/pull-secret regression. Matches + # newrelic/alerts/deploy-runtime-failed-detected.json. + expr: | + sum(increase(instant_deploy_runtime_failed_detected_total[30m])) >= 3 + for: 5m + labels: + severity: critical + service: worker + annotations: + summary: "deploys failing to start at runtime — >= 3 in 30m, likely a platform image defect" + description: | + sum(increase(instant_deploy_runtime_failed_detected_total[30m])) >= 3. + deploy_status_reconcile is flipping rollouts to 'failed' because they + exceeded their progress deadline with no available replica — the pods were + created but their containers can't start (broken built image: no + CMD/ENTRYPOINT, ImagePullBackOff, or CrashLoopBackOff). A platform-wide + spike means the build pipeline is producing unstartable images (check for + empty/near-zero-byte images in ghcr.io/.../instant-userapp/*) or a + registry/pull-secret regression. Per-customer this is a one-off bad image. + Source: worker/internal/jobs/deploy_status_reconcile.go (counter + DeployRuntimeFailedDetectedTotal); user-facing reason=StartFailed. + # instant-worker — scale-to-zero idle-scaler (Task #54). # instant_deploy_scaled_to_zero_total{outcome} is emitted by # worker/internal/jobs/deploy_idle_scaler.go. INERT until an operator sets diff --git a/newrelic/alerts/deploy-runtime-failed-detected.json b/newrelic/alerts/deploy-runtime-failed-detected.json new file mode 100644 index 0000000..af5731b --- /dev/null +++ b/newrelic/alerts/deploy-runtime-failed-detected.json @@ -0,0 +1,31 @@ +{ + "name": "instant-worker — deploy runtime start-failures detected (30m) [silent-deploy-failure backstop]", + "type": "NRQL", + "description": "Fires when deploy_status_reconcile flips rollouts to 'failed' because they exceeded their progress deadline with no available replica — the RUNTIME twin of deploy-job-failed-detected.json (CLAUDE.md rule 27). The build SUCCEEDED but the produced image can't start: CreateContainerError ('no command specified' from an empty/near-zero-byte image), ImagePullBackOff, or CrashLoopBackOff. Before the 2026-06-08 fix these deploys reported 'deploying' forever (deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts), so no autopsy ran and the user got no failure email; now a Progressing=False/ProgressDeadlineExceeded rollout maps to 'failed' (user-facing reason=StartFailed). Per-customer this is a one-off bad image (low rate, expected). A platform-wide spike means the build pipeline is producing unstartable images (the 2026-06-08 live symptom was 6 deploys at once pulling 474-byte empty images) or a registry/pull-secret regression. P1: user-visible, recoverable. The query is derivative(...,30 minutes) per reason (NR ingests the counter as a cumulative monotonic OTLP sum). ABOVE_OR_EQUALS 3 over a 30m window pages on a small cluster of runtime start-failures while tolerating one or two individual bad images. Source: worker/internal/jobs/deploy_status_reconcile.go; counter DeployRuntimeFailedDetectedTotal in worker/internal/metrics/metrics.go.", + "enabled": true, + "nrql": { + "query": "SELECT derivative(instant_deploy_runtime_failed_detected_total, 30 minutes) FROM Metric WHERE service = 'worker' FACET reason" + }, + "terms": [ + { + "priority": "CRITICAL", + "operator": "ABOVE_OR_EQUALS", + "threshold": 3, + "thresholdDuration": 300, + "thresholdOccurrences": "AT_LEAST_ONCE" + } + ], + "signal": { + "aggregationWindow": 300, + "aggregationMethod": "EVENT_FLOW", + "aggregationDelay": 180, + "fillOption": "STATIC", + "fillValue": 0 + }, + "expiration": { + "expirationDuration": 3600, + "openViolationOnExpiration": false, + "closeViolationsOnExpiration": true + }, + "violationTimeLimitSeconds": 86400 +} diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json index f50a654..5a6276e 100644 --- a/newrelic/dashboards/instanode-reliability.json +++ b/newrelic/dashboards/instanode-reliability.json @@ -925,6 +925,62 @@ } } }, + { + "title": "Deploy runtime start-failures (1h, detected; must be 0 in steady state)", + "layout": { + "column": 1, + "row": 78, + "width": 3, + "height": 3 + }, + "visualization": { + "id": "viz.billboard" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT derivative(instant_deploy_runtime_failed_detected_total, 1 hour) AS 'runtime start-failures (1h)' FROM Metric WHERE service = 'worker' SINCE 1 hour ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + }, + "thresholds": [ + { + "alertSeverity": "WARNING", + "value": 1 + } + ] + } + }, + { + "title": "Deploy runtime start-failures by reason (6h)", + "layout": { + "column": 4, + "row": 78, + "width": 6, + "height": 3 + }, + "visualization": { + "id": "viz.stacked-bar" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT rate(sum(instant_deploy_runtime_failed_detected_total), 1 minute) FROM Metric WHERE service = 'worker' FACET reason TIMESERIES SINCE 6 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + }, { "title": "Billing reconciler gap detected by direction (6h)", "layout": { diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md index 7bc0369..8db9574 100644 --- a/observability/METRICS-CATALOG.md +++ b/observability/METRICS-CATALOG.md @@ -55,6 +55,7 @@ fires. Operators need this so they don't panic when a fresh deploy looks | `instant_entitlement_drift_detected_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts Postgres resources found drifted below their team's plan tier per sweep) | `entitlement-drift-outpacing-regrade.json` (paired with `_regraded_total`) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" | | `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" | | `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" | +| `instant_deploy_runtime_failed_detected_total` | worker | `reason` | lazy (CounterVec — runtime twin of `_job_failed_detected_total`; first observation is a rollout flipped to failed on ProgressDeadlineExceeded with no available replica (broken image can't start: CreateContainerError "no command specified" / ImagePullBackOff / CrashLoopBackOff). reason currently only "progress_deadline_exceeded"; metrics_test primes it so it registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-06-08) | `deploy-runtime-failed-detected.json` | `DeployRuntimeFailedDetected` (instant-worker-deploy-runtime-failed group) | "Deploy runtime start-failures by reason (6h)", "Deploy runtime start-failures (1h, detected; must be 0 in steady state)" | | `instant_billing_reconciler_gap_detected_total` | worker | `direction` | lazy (CounterVec — direction ∈ {upgrade, downgrade}; series materialise on the first detected mismatch between Razorpay subscription state and teams.plan_tier. The primary signal for a dropped Razorpay webhook) | `billing-reconciler-gap-detected.json` | `BillingReconcilerGapDetected` (instant-worker-billing-gap group) | "Billing reconciler gap detected by direction (6h)" | | `instant_deploy_scaled_to_zero_total` | worker | `outcome` | lazy (CounterVec — outcome ∈ {scaled_down, woke_up, wake_failed, scale_failed}; all four primed in metrics_test so the series register at boot. INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED. scaled_down = idle app descheduled to replicas=0 (~$0 compute, the savings path); wake_failed = app stuck asleep (P1, user-visible); scale_failed = scale-DOWN k8s/DB error, row untouched + retried (P2). Task #54) | `deploy-scale-to-zero-fail.json` (wake_failed) | `DeployScaleToZeroWakeFailed` + `DeployScaleToZeroScaleDownFailures` (instant-worker-deploy-scale-to-zero group) | "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)" | | `instant_deploy_idle_apps` | worker | (none) | **eager** (Gauge — sampled at the end of every idle-scaler tick; the count of deployments currently scaled_to_zero=true. Headline "how much compute scale-to-zero is reclaiming" signal. Stays 0 until DEPLOY_SCALE_TO_ZERO_ENABLED is on. Task #54) | (no standalone alert — capacity signal, not a fault) | (no standalone rule) | "Scale-to-zero — apps currently asleep (replicas=0)" |