Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions k8s/prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,24 @@ spec:
annotations:
summary: "P99 provision latency > 5s (instant_http_request_duration_seconds)"

- alert: ResourceCountCapBlocked
# Task #55: per-service resource-COUNT cap rejections. INERT until an
# operator sets RESOURCE_COUNT_CAPS_ENABLED (default off), so this rule
# stays quiet until enforcement is enabled. P2 (abuse/observability):
# a sustained block rate after enable is either a tenant hammering a
# cap (upsell/abuse signal) or a too-low cap (revisit plans.yaml).
# Lazy CounterVec — the {service,team_tier} series only appears after
# the first block.
expr: |
sum by (service, team_tier) (
rate(instant_resource_count_limit_blocked_total[1h])
) * 3600 > 20
for: 1h
labels:
severity: warning
annotations:
summary: "Resource-count cap blocking > 20 provisions/h for a tier+service (instant_resource_count_limit_blocked_total)"

- alert: APIDown
expr: up{job="instant-api"} == 0
for: 1m
Expand Down
31 changes: 31 additions & 0 deletions newrelic/alerts/resource-count-limit-blocked.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"name": "instant-api — resource-count cap blocks (1h) [Task #55]",
"type": "NRQL",
"description": "P2 (abuse/observability). Fires when the per-service resource-COUNT cap rejects provisions. instant_resource_count_limit_blocked_total{service,team_tier} is emitted by api/internal/handlers/resource_count_cap.go when a team at its per-tier count cap (postgres/vector/redis/mongodb/storage) attempts another provision and gets 402. The cap closes the strict-≥80%-margin hole where only queue_count was capped — a tenant could otherwise create MANY resources each at the per-resource size cap and blow the saturated-COGS bound (Redis the binding constraint at $6.50/GB). The whole feature is INERT until an operator sets RESOURCE_COUNT_CAPS_ENABLED (default off), so this alert sits quiet until enforcement is enabled. A non-trivial, sustained rate after enable means either (a) a tenant is hammering against a cap (upsell/abuse signal — point sales/support at the team) or (b) the cap is set too low for legitimate use (revisit plans.yaml). P2 because it is not data-loss and not a user-blocking outage — the 402 is the intended, recoverable behaviour with an agent_action telling the user to upgrade. Query is derivative(...,1 hour) per service+tier (NR ingests the counter as a cumulative monotonic OTLP sum). Source: api/internal/handlers/resource_count_cap.go; counter ResourceCountLimitBlocked in api/internal/metrics/metrics.go.",
"enabled": true,
"nrql": {
"query": "SELECT derivative(instant_resource_count_limit_blocked_total, 1 hour) FROM Metric WHERE service = 'api' FACET service, team_tier"
},
"terms": [
{
"priority": "WARNING",
"operator": "ABOVE",
"threshold": 20,
"thresholdDuration": 3600,
"thresholdOccurrences": "ALL"
}
],
"signal": {
"aggregationWindow": 300,
"aggregationMethod": "EVENT_FLOW",
"aggregationDelay": 180,
"fillOption": "STATIC",
"fillValue": 0
},
"expiration": {
"expirationDuration": 3600,
"openViolationOnExpiration": false,
"closeViolationsOnExpiration": true
},
"violationTimeLimitSeconds": 86400
}
25 changes: 25 additions & 0 deletions newrelic/dashboards/instanode-reliability.json
Original file line number Diff line number Diff line change
Expand Up @@ -1336,6 +1336,31 @@
"ignoreTimeRange": false
}
}
},
{
"title": "Resource-count cap blocks by service+tier (6h; Task #55, inert until RESOURCE_COUNT_CAPS_ENABLED)",
"layout": {
"column": 1,
"row": 66,
"width": 6,
"height": 3
},
"visualization": {
"id": "viz.stacked-bar"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT rate(sum(instant_resource_count_limit_blocked_total), 1 minute) FROM Metric WHERE service = 'api' FACET service, team_tier TIMESERIES SINCE 6 hours ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
}
]
}
Expand Down
1 change: 1 addition & 0 deletions observability/METRICS-CATALOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ fires. Operators need this so they don't panic when a fresh deploy looks
| `instant_flow_test_total` | worker | `flow,actor,tier,layer,result` | lazy (CounterVec — INERT until `FLOW_SYNTHETIC_ENABLED=true`; once on, `pass`/`degraded` materialise on the first happy tick and `fail` only on a real regression. Continuous-monitoring synthetic flow runner (`flow_synthetic.go`): every 5 min runs the P0 flow matrix (healthz / auth_me / provision→reap) against prod. The matrix dashboard FACETs this into the green/red grid, one cell per flow×actor) | `flow-test-p0-fail.json`, `flow-test-silent-death.json` | `FlowTestP0Fail`, `FlowTestSilentDeath` (instant-worker-flow-synthetic group) | "Flow matrix — latest result per flow×actor (grid)", "Flow matrix — fails by flow (1h, must be 0)" |
| `instant_flow_test_latency_seconds` | worker | `flow,actor,tier,layer` | lazy (HistogramVec — observation only on a real HTTP response; DNS/TCP errors omit it so the histogram isn't polluted with 0s timeouts. INERT until `FLOW_SYNTHETIC_ENABLED=true`) | `flow-test-latency-regression.json` | `FlowTestLatencyRegression` (instant-worker-flow-synthetic group) | "Flow matrix — P95 latency per flow (6h)" |
| `instant_flow_synthetic_reaped_total` | worker | `flow,outcome` | lazy (CounterVec — rule-24 cleanup ledger; `reaped` materialises on the first provision→reap tick, `leaked` ONLY on a failed reap (a real DO/k8s resource leak — must stay 0), `skip` when a flow created nothing. INERT until `FLOW_SYNTHETIC_ENABLED=true`) | `flow-synthetic-leak.json` | `FlowSyntheticLeak` (instant-worker-flow-synthetic group) | "Flow synthetic — leaked reaps (1h, must be 0)" |
| `instant_resource_count_limit_blocked_total` | api | `service,team_tier` | lazy (CounterVec — Task #55. INERT until `RESOURCE_COUNT_CAPS_ENABLED=true`; once on, a `{service,team_tier}` series materialises the first time a team at its per-tier count cap (postgres/vector/redis/mongodb/storage) is rejected with 402. Closes the strict-≥80%-margin hole where only queue_count was capped — Redis the binding constraint at $6.50/GB. A sustained rate after enable = tenant hammering a cap (upsell/abuse) or a too-low cap. P2.) | `resource-count-limit-blocked.json` | `ResourceCountCapBlocked` (instant-api group) | "Resource-count cap blocks by service+tier (6h; Task #55, inert until RESOURCE_COUNT_CAPS_ENABLED)" |

## Lazy-emit gotcha — what operators should expect

Expand Down
Loading