Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions k8s/prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -917,6 +917,48 @@ spec:
build pod logs in instant-deploy-* namespaces. Source:
worker/internal/jobs/deploy_status_reconcile.go.

# instant-worker — scale-to-zero idle-scaler (Task #54).
# instant_deploy_scaled_to_zero_total{outcome} is emitted by
# worker/internal/jobs/deploy_idle_scaler.go. INERT until an operator sets
# DEPLOY_SCALE_TO_ZERO_ENABLED (default off), so these rules sit quiet until
# the feature is canaried on. wake_failed = a user's app may be stuck asleep
# (P1, user-visible recoverable); scale_failed = a scale-DOWN k8s patch / DB
# flip failed (P2, the row is left untouched + retried next tick).
- name: instant-worker-deploy-scale-to-zero
rules:
- alert: DeployScaleToZeroWakeFailed
expr: |
sum(increase(instant_deploy_scaled_to_zero_total{outcome="wake_failed"}[15m])) > 0
for: 10m
labels:
severity: critical
service: worker
annotations:
summary: "scale-to-zero wake failures > 0 (15m) — an app may be stuck asleep"
description: |
instant_deploy_scaled_to_zero_total{outcome="wake_failed"} increased over the
last 15m, sustained 10m. A scaled-to-zero app failed to wake (k8s scale-up
error), so a customer request to a sleeping app is returning the ingress
upstream-down response with no recovery. Check instant-worker logs for
`jobs.deploy_idle_scaler` and the api `deploy.wake.scale_failed` lines, and
the health of the instant-deploy-* namespaces. Source:
worker/internal/jobs/deploy_idle_scaler.go + api deploy_wake.go.
- alert: DeployScaleToZeroScaleDownFailures
expr: |
sum(increase(instant_deploy_scaled_to_zero_total{outcome="scale_failed"}[30m])) >= 5
for: 30m
labels:
severity: warning
service: worker
annotations:
summary: "scale-to-zero scale-DOWN failures >= 5 in 30m"
description: |
instant_deploy_scaled_to_zero_total{outcome="scale_failed"} reached >= 5 in
30m. The idle-scaler is repeatedly failing to deschedule idle apps (k8s patch
or DB flip error). Apps stay running (no customer impact) but the compute
savings are not landing. Check instant-worker RBAC for deployments/patch and
the k8s API health. Source: worker/internal/jobs/deploy_idle_scaler.go.

# instant-worker — billing reconciler gap detected (Rule 25 sweep 2026-06-04).
# instant_billing_reconciler_gap_detected_total is the PRIMARY signal for a
# dropped Razorpay webhook: each gap = a team whose teams.plan_tier disagrees
Expand Down
31 changes: 31 additions & 0 deletions newrelic/alerts/deploy-scale-to-zero-fail.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"name": "instant-worker — scale-to-zero wake failures (15m) [Task #54]",
"type": "NRQL",
"description": "Fires when the scale-to-zero idle-scaler reports wake failures. instant_deploy_scaled_to_zero_total{outcome=\"wake_failed\"} is emitted by worker/internal/jobs/deploy_idle_scaler.go (and the api wake endpoint records its own failures). A wake_failed event means a scaled-to-zero app could not be brought back to replicas=1 — a customer request to a sleeping app is returning the ingress upstream-down response with no recovery, so this is P1 (user-visible, recoverable). The whole feature is INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED (default off), so this alert sits quiet until scale-to-zero is canaried on. The query is derivative(...,15 minutes) per outcome (NR ingests the counter as a cumulative monotonic OTLP sum); ABOVE 0 over a 15m window pages on any wake failure. A separate scale_failed outcome (scale-DOWN failures, no customer impact) is a P2 surfaced on the dashboard tile only. Source: worker/internal/jobs/deploy_idle_scaler.go; counter DeployScaledToZeroTotal in worker/internal/metrics/metrics.go.",
"enabled": true,
"nrql": {
"query": "SELECT derivative(instant_deploy_scaled_to_zero_total, 15 minutes) FROM Metric WHERE service = 'worker' AND outcome = 'wake_failed' FACET outcome"
},
"terms": [
{
"priority": "CRITICAL",
"operator": "ABOVE",
"threshold": 0,
"thresholdDuration": 600,
"thresholdOccurrences": "AT_LEAST_ONCE"
}
],
"signal": {
"aggregationWindow": 300,
"aggregationMethod": "EVENT_FLOW",
"aggregationDelay": 180,
"fillOption": "STATIC",
"fillValue": 0
},
"expiration": {
"expirationDuration": 3600,
"openViolationOnExpiration": false,
"closeViolationsOnExpiration": true
},
"violationTimeLimitSeconds": 86400
}
108 changes: 79 additions & 29 deletions newrelic/dashboards/instanode-reliability.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "instanode reliability",
"name": "instanode \u2014 reliability",
"description": "Single-pane reliability view rolling up every new metric shipped in the 2026-05-20 observability sweep: /readyz across all 3 services, propagation_runner queue health, orphan_sweep activity, magic-link rate-limiting, Brevo send/error/webhook funnel, provisioner circuit-breaker state, missing-renderer events, and storage presign throughput. Source: same Metric + Log feeds as the per-surface dashboards. Apply via newrelic/apply.sh.",
"permissions": "PUBLIC_READ_WRITE",
"pages": [
Expand Down Expand Up @@ -149,7 +149,7 @@
}
},
{
"title": "Orphan sweep reaped by reason (24h)",
"title": "Orphan sweep \u2014 reaped by reason (24h)",
"layout": {
"column": 1,
"row": 7,
Expand All @@ -174,7 +174,7 @@
}
},
{
"title": "Orphan sweep reap failures by reason (24h)",
"title": "Orphan sweep \u2014 reap failures by reason (24h)",
"layout": {
"column": 7,
"row": 7,
Expand Down Expand Up @@ -334,7 +334,7 @@
}
},
{
"title": "Email failover outcomes (1h) fallback_ok=primary degraded (P1), all_failed=email lost (P0)",
"title": "Email failover outcomes (1h) \u2014 fallback_ok=primary degraded (P1), all_failed=email lost (P0)",
"layout": {
"column": 1,
"row": 62,
Expand Down Expand Up @@ -409,7 +409,7 @@
}
},
{
"title": "Billing charge undeliverable (paid, NOT upgraded) must be 0",
"title": "Billing charge undeliverable (paid, NOT upgraded) \u2014 must be 0",
"layout": {
"column": 1,
"row": 19,
Expand Down Expand Up @@ -440,7 +440,7 @@
}
},
{
"title": "Tier-upgrade TTL promote outcomes (24h) error must be 0",
"title": "Tier-upgrade TTL promote outcomes (24h) \u2014 error must be 0",
"layout": {
"column": 1,
"row": 25,
Expand Down Expand Up @@ -490,7 +490,7 @@
}
},
{
"title": "Idempotency replay refunds by route (1h) FINDING API-1",
"title": "Idempotency replay refunds by route (1h) \u2014 FINDING API-1",
"layout": {
"column": 1,
"row": 22,
Expand All @@ -515,7 +515,7 @@
}
},
{
"title": "AUTH-004 synthetic prober outcomes per leg (1h)",
"title": "AUTH-004 synthetic prober \u2014 outcomes per leg (1h)",
"layout": {
"column": 1,
"row": 25,
Expand All @@ -540,7 +540,7 @@
}
},
{
"title": "AUTH-004 synthetic prober fails (last 1h, must be 0)",
"title": "AUTH-004 synthetic prober \u2014 fails (last 1h, must be 0)",
"layout": {
"column": 7,
"row": 25,
Expand Down Expand Up @@ -571,7 +571,7 @@
}
},
{
"title": "AUTH-004 synthetic prober P95 latency per leg (1h)",
"title": "AUTH-004 synthetic prober \u2014 P95 latency per leg (1h)",
"layout": {
"column": 10,
"row": 25,
Expand All @@ -596,7 +596,7 @@
}
},
{
"title": "Hourly deploy prober outcomes per leg (6h)",
"title": "Hourly deploy prober \u2014 outcomes per leg (6h)",
"layout": {
"column": 1,
"row": 28,
Expand All @@ -621,7 +621,7 @@
}
},
{
"title": "Hourly deploy prober fails (last 6h, must be 0)",
"title": "Hourly deploy prober \u2014 fails (last 6h, must be 0)",
"layout": {
"column": 7,
"row": 28,
Expand Down Expand Up @@ -652,7 +652,7 @@
}
},
{
"title": "Hourly deploy prober P95 latency per leg (6h)",
"title": "Hourly deploy prober \u2014 P95 latency per leg (6h)",
"layout": {
"column": 10,
"row": 28,
Expand All @@ -677,7 +677,7 @@
}
},
{
"title": "GitHub webhook received by event+result (6h) [P4 push-to-deploy]",
"title": "GitHub webhook \u2014 received by event+result (6h) [P4 push-to-deploy]",
"layout": {
"column": 1,
"row": 31,
Expand All @@ -702,7 +702,7 @@
}
},
{
"title": "GitHub webhook bad_signature count (1h, must be 0 in steady state)",
"title": "GitHub webhook \u2014 bad_signature count (1h, must be 0 in steady state)",
"layout": {
"column": 7,
"row": 31,
Expand Down Expand Up @@ -733,7 +733,7 @@
}
},
{
"title": "GitHub push-to-deploy enqueued vs errors (6h)",
"title": "GitHub push-to-deploy \u2014 enqueued vs errors (6h)",
"layout": {
"column": 10,
"row": 31,
Expand Down Expand Up @@ -764,7 +764,7 @@
}
},
{
"title": "GitHub push-to-deploy result breakdown (6h)",
"title": "GitHub push-to-deploy \u2014 result breakdown (6h)",
"layout": {
"column": 1,
"row": 34,
Expand All @@ -789,7 +789,7 @@
}
},
{
"title": "GitHub App token mint result breakdown (6h)",
"title": "GitHub App token mint \u2014 result breakdown (6h)",
"layout": {
"column": 7,
"row": 34,
Expand Down Expand Up @@ -951,7 +951,7 @@
}
},
{
"title": "Postgres pool saturation ratio by service+pool (3h) alert > 0.8",
"title": "Postgres pool saturation ratio by service+pool (3h) \u2014 alert > 0.8",
"layout": {
"column": 1,
"row": 43,
Expand All @@ -976,7 +976,7 @@
}
},
{
"title": "Postgres pool peak saturation (1h, in_use/max) alert > 0.8",
"title": "Postgres pool peak saturation (1h, in_use/max) \u2014 alert > 0.8",
"layout": {
"column": 7,
"row": 43,
Expand Down Expand Up @@ -1038,7 +1038,7 @@
}
},
{
"title": "Redis maxmemory regrade failures (6h) quota not enforced when > 0",
"title": "Redis maxmemory regrade failures (6h) \u2014 quota not enforced when > 0",
"layout": {
"column": 1,
"row": 46,
Expand All @@ -1063,7 +1063,7 @@
}
},
{
"title": "Expiry deprovision failures (6h) orphan infra accumulating when > 0",
"title": "Expiry deprovision failures (6h) \u2014 orphan infra accumulating when > 0",
"layout": {
"column": 7,
"row": 46,
Expand Down Expand Up @@ -1113,7 +1113,7 @@
}
},
{
"title": "Pool reaper actions by status/outcome (24h)",
"title": "Pool reaper \u2014 actions by status/outcome (24h)",
"layout": {
"column": 7,
"row": 49,
Expand All @@ -1138,7 +1138,7 @@
}
},
{
"title": "Pool items stuck 'assigned' past 30m leaked shared infra (must be 0)",
"title": "Pool items stuck 'assigned' past 30m \u2014 leaked shared infra (must be 0)",
"layout": {
"column": 1,
"row": 52,
Expand All @@ -1163,7 +1163,7 @@
}
},
{
"title": "Flow matrix latest result per flow×actor (synthetic; red=fail)",
"title": "Flow matrix \u2014 latest result per flow\u00d7actor (synthetic; red=fail)",
"layout": {
"column": 1,
"row": 56,
Expand All @@ -1188,7 +1188,7 @@
}
},
{
"title": "Flow matrix fails by flow (1h, must be 0)",
"title": "Flow matrix \u2014 fails by flow (1h, must be 0)",
"layout": {
"column": 7,
"row": 56,
Expand All @@ -1213,7 +1213,7 @@
}
},
{
"title": "Flow synthetic leaked reaps (1h, must be 0)",
"title": "Flow synthetic \u2014 leaked reaps (1h, must be 0)",
"layout": {
"column": 10,
"row": 56,
Expand All @@ -1238,7 +1238,7 @@
}
},
{
"title": "Flow matrix P95 latency per flow (6h)",
"title": "Flow matrix \u2014 P95 latency per flow (6h)",
"layout": {
"column": 1,
"row": 59,
Expand All @@ -1263,7 +1263,7 @@
}
},
{
"title": "Flow matrix distinct flows reporting (15m; silent-death watch)",
"title": "Flow matrix \u2014 distinct flows reporting (15m; silent-death watch)",
"layout": {
"column": 7,
"row": 59,
Expand All @@ -1286,6 +1286,56 @@
"ignoreTimeRange": false
}
}
},
{
"title": "Scale-to-zero \u2014 apps currently asleep (replicas=0)",
"layout": {
"column": 1,
"row": 62,
"width": 3,
"height": 3
},
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT latest(instant_deploy_idle_apps) AS 'apps asleep' FROM Metric WHERE service = 'worker' SINCE 10 minutes ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "Scale-to-zero actions by outcome (6h; wake_failed/scale_failed must be 0)",
"layout": {
"column": 4,
"row": 62,
"width": 9,
"height": 3
},
"visualization": {
"id": "viz.stacked-bar"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT rate(sum(instant_deploy_scaled_to_zero_total), 1 minute) FROM Metric WHERE service = 'worker' FACET outcome TIMESERIES SINCE 6 hours ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
}
]
}
Expand Down
Loading
Loading