Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions k8s/prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -981,6 +981,42 @@ spec:
build pod logs in instant-deploy-* namespaces. Source:
worker/internal/jobs/deploy_status_reconcile.go.

# instant-worker — runtime rollout-failure detector (2026-06-08). Twin of the
# build-Job detector above, for the RUNTIME side: the build SUCCEEDED but the
# produced image can't start (CreateContainerError "no command specified",
# ImagePullBackOff, CrashLoopBackOff). deploy_status_reconcile now maps a
# ProgressDeadlineExceeded rollout with no available replica to 'failed'
# (was 'deploying' forever). instant_deploy_runtime_failed_detected_total
# {reason} is emitted by worker/internal/jobs/deploy_status_reconcile.go.
- name: instant-worker-deploy-runtime-failed
rules:
- alert: DeployRuntimeFailedDetected
# Absolute count over a 30m window (same posture as DeployJobFailedDetected).
# A single broken-image deploy is per-customer and expected at a low rate;
# >= 3 in 30m means a PLATFORM build defect is producing unstartable images
# (e.g. empty 474-byte images with no CMD — the 2026-06-08 live symptom)
# or a registry/pull-secret regression. Matches
# newrelic/alerts/deploy-runtime-failed-detected.json.
expr: |
sum(increase(instant_deploy_runtime_failed_detected_total[30m])) >= 3
for: 5m
labels:
severity: critical
service: worker
annotations:
summary: "deploys failing to start at runtime — >= 3 in 30m, likely a platform image defect"
description: |
sum(increase(instant_deploy_runtime_failed_detected_total[30m])) >= 3.
deploy_status_reconcile is flipping rollouts to 'failed' because they
exceeded their progress deadline with no available replica — the pods were
created but their containers can't start (broken built image: no
CMD/ENTRYPOINT, ImagePullBackOff, or CrashLoopBackOff). A platform-wide
spike means the build pipeline is producing unstartable images (check for
empty/near-zero-byte images in ghcr.io/.../instant-userapp/*) or a
registry/pull-secret regression. Per-customer this is a one-off bad image.
Source: worker/internal/jobs/deploy_status_reconcile.go (counter
DeployRuntimeFailedDetectedTotal); user-facing reason=StartFailed.

# instant-worker — scale-to-zero idle-scaler (Task #54).
# instant_deploy_scaled_to_zero_total{outcome} is emitted by
# worker/internal/jobs/deploy_idle_scaler.go. INERT until an operator sets
Expand Down
31 changes: 31 additions & 0 deletions newrelic/alerts/deploy-runtime-failed-detected.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"name": "instant-worker — deploy runtime start-failures detected (30m) [silent-deploy-failure backstop]",
"type": "NRQL",
"description": "Fires when deploy_status_reconcile flips rollouts to 'failed' because they exceeded their progress deadline with no available replica — the RUNTIME twin of deploy-job-failed-detected.json (CLAUDE.md rule 27). The build SUCCEEDED but the produced image can't start: CreateContainerError ('no command specified' from an empty/near-zero-byte image), ImagePullBackOff, or CrashLoopBackOff. Before the 2026-06-08 fix these deploys reported 'deploying' forever (deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts), so no autopsy ran and the user got no failure email; now a Progressing=False/ProgressDeadlineExceeded rollout maps to 'failed' (user-facing reason=StartFailed). Per-customer this is a one-off bad image (low rate, expected). A platform-wide spike means the build pipeline is producing unstartable images (the 2026-06-08 live symptom was 6 deploys at once pulling 474-byte empty images) or a registry/pull-secret regression. P1: user-visible, recoverable. The query is derivative(...,30 minutes) per reason (NR ingests the counter as a cumulative monotonic OTLP sum). ABOVE_OR_EQUALS 3 over a 30m window pages on a small cluster of runtime start-failures while tolerating one or two individual bad images. Source: worker/internal/jobs/deploy_status_reconcile.go; counter DeployRuntimeFailedDetectedTotal in worker/internal/metrics/metrics.go.",
"enabled": true,
"nrql": {
"query": "SELECT derivative(instant_deploy_runtime_failed_detected_total, 30 minutes) FROM Metric WHERE service = 'worker' FACET reason"
},
"terms": [
{
"priority": "CRITICAL",
"operator": "ABOVE_OR_EQUALS",
"threshold": 3,
"thresholdDuration": 300,
"thresholdOccurrences": "AT_LEAST_ONCE"
}
],
"signal": {
"aggregationWindow": 300,
"aggregationMethod": "EVENT_FLOW",
"aggregationDelay": 180,
"fillOption": "STATIC",
"fillValue": 0
},
"expiration": {
"expirationDuration": 3600,
"openViolationOnExpiration": false,
"closeViolationsOnExpiration": true
},
"violationTimeLimitSeconds": 86400
}
56 changes: 56 additions & 0 deletions newrelic/dashboards/instanode-reliability.json
Original file line number Diff line number Diff line change
Expand Up @@ -925,6 +925,62 @@
}
}
},
{
"title": "Deploy runtime start-failures (1h, detected; must be 0 in steady state)",
"layout": {
"column": 1,
"row": 78,
"width": 3,
"height": 3
},
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT derivative(instant_deploy_runtime_failed_detected_total, 1 hour) AS 'runtime start-failures (1h)' FROM Metric WHERE service = 'worker' SINCE 1 hour ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"thresholds": [
{
"alertSeverity": "WARNING",
"value": 1
}
]
}
},
{
"title": "Deploy runtime start-failures by reason (6h)",
"layout": {
"column": 4,
"row": 78,
"width": 6,
"height": 3
},
"visualization": {
"id": "viz.stacked-bar"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT rate(sum(instant_deploy_runtime_failed_detected_total), 1 minute) FROM Metric WHERE service = 'worker' FACET reason TIMESERIES SINCE 6 hours ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "Billing reconciler gap detected by direction (6h)",
"layout": {
Expand Down
1 change: 1 addition & 0 deletions observability/METRICS-CATALOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ fires. Operators need this so they don't panic when a fresh deploy looks
| `instant_entitlement_drift_detected_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts Postgres resources found drifted below their team's plan tier per sweep) | `entitlement-drift-outpacing-regrade.json` (paired with `_regraded_total`) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
| `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
| `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" |
| `instant_deploy_runtime_failed_detected_total` | worker | `reason` | lazy (CounterVec — runtime twin of `_job_failed_detected_total`; first observation is a rollout flipped to failed on ProgressDeadlineExceeded with no available replica (broken image can't start: CreateContainerError "no command specified" / ImagePullBackOff / CrashLoopBackOff). reason currently only "progress_deadline_exceeded"; metrics_test primes it so it registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-06-08) | `deploy-runtime-failed-detected.json` | `DeployRuntimeFailedDetected` (instant-worker-deploy-runtime-failed group) | "Deploy runtime start-failures by reason (6h)", "Deploy runtime start-failures (1h, detected; must be 0 in steady state)" |
| `instant_billing_reconciler_gap_detected_total` | worker | `direction` | lazy (CounterVec — direction ∈ {upgrade, downgrade}; series materialise on the first detected mismatch between Razorpay subscription state and teams.plan_tier. The primary signal for a dropped Razorpay webhook) | `billing-reconciler-gap-detected.json` | `BillingReconcilerGapDetected` (instant-worker-billing-gap group) | "Billing reconciler gap detected by direction (6h)" |
| `instant_deploy_scaled_to_zero_total` | worker | `outcome` | lazy (CounterVec — outcome ∈ {scaled_down, woke_up, wake_failed, scale_failed}; all four primed in metrics_test so the series register at boot. INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED. scaled_down = idle app descheduled to replicas=0 (~$0 compute, the savings path); wake_failed = app stuck asleep (P1, user-visible); scale_failed = scale-DOWN k8s/DB error, row untouched + retried (P2). Task #54) | `deploy-scale-to-zero-fail.json` (wake_failed) | `DeployScaleToZeroWakeFailed` + `DeployScaleToZeroScaleDownFailures` (instant-worker-deploy-scale-to-zero group) | "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)" |
| `instant_deploy_idle_apps` | worker | (none) | **eager** (Gauge — sampled at the end of every idle-scaler tick; the count of deployments currently scaled_to_zero=true. Headline "how much compute scale-to-zero is reclaiming" signal. Stays 0 until DEPLOY_SCALE_TO_ZERO_ENABLED is on. Task #54) | (no standalone alert — capacity signal, not a fault) | (no standalone rule) | "Scale-to-zero — apps currently asleep (replicas=0)" |
Expand Down
Loading