feat(observability): make the stack production-ready by devantler · Pull Request #1604 · devantler-tech/platform

devantler · 2026-05-27T21:49:25Z

Why

The observability stack was deliberately minimal — alerting-only, no Grafana, no remote-write, single ephemeral Prometheus, in-cluster-only alerting (documented in docs/dr/alerting.md). This makes it production-ready while staying self-hosted (no SaaS metrics tier). Direction was agreed interactively; the work reverses several of those documented choices, so docs/dr/alerting.md is rewritten to match.

Done in phases; all phases are in this PR. Happy to split into separate PRs if preferred.

Phase 0 — Capacity

ksail.prod.yaml: static workers: 3 → 4. The new always-on tier (Grafana/Loki/persistent Prometheus+Alertmanager) belongs on guaranteed static capacity, not autoscaler nodes the autoscaler reclaims. The 4th worker auto-joins Longhorn via the uniform worker node-label; longhorn_replica_count stays 3.
Restore alertmanager_replicas 1 → 2 (was a #1585 stabilization trim; prod now has headroom per #1601).

Phase 1 — Resilient alerting

Alerts → Slack via native slack_configs (api_url_file).
Dead-man's-switch: the always-firing Watchdog is routed to a heartbeat receiver that POSTs to an external monitor every ~50s. If the cluster/alerting pipeline dies, the monitor (e.g. healthchecks.io) notifies Slack out-of-band — the one failure in-cluster alerting can't cover. URL is Flux-substituted with an invalid default, so local/CI stay quiet.
Re-enable curated defaultRules (Watchdog + self-monitoring + kubernetesApps/storage/node), disabling groups for unscraped control-plane components and the noisy overcommit alerts. platform-critical.yaml (Velero/CNPG/Flux/cert/autoscaler) is unchanged.

Phase 2 — Durability

hetzner overlay: persistent hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus memory limit → 1.5Gi (VPA manages requests only, so the limit is the real OOM guard).
Off-cluster backup: Velero's daily *-namespace backup already covers monitoring, so the new PVCs ship to R2 at 24h RPO with no Velero change.

Phase 3 — Centralized logs

Loki (single-binary, 7d retention; hcloud PVC in prod, ephemeral local) + Alloy DaemonSet shipper (node-local discovery → no log duplication; tails via the API, no privileged hostPath).

Phase 4 — Visibility & access

Grafana enabled (anonymous Admin behind the SSO gate, Prometheus + Loki datasources, provisioned dashboards, ephemeral).
Expose Grafana / Prometheus / Alertmanager / OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies.

Manual steps required before this fully works in prod

The agent cannot edit *.enc.yaml. After merge, set the per-cluster secrets (see docs/dr/alerting.md):

sops --set '["stringData"]["alertmanager_webhook_url"] "<slack-incoming-webhook>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml
sops --set '["stringData"]["alertmanager_heartbeat_url"] "<healthchecks.io-ping-url>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml

Then create the Slack #platform-alerts incoming webhook and the healthchecks.io check (period ~5m, grace ~10m, Slack integration). Until set, alerts/heartbeat degrade gracefully to invalid URLs (no breakage).

Validation

kubectl kustomize k8s/clusters/local/ and …/prod/ build.
ksail workload validate and ksail --config ksail.prod.yaml workload validate → 259 files validated.
Loki 6.55.0 / Alloy 1.8.2 values helm template-verified (service names/ports, RBAC for pods/log).
Full Talos+Docker system test runs in CI.

Risks

Memory: heaviest additions (Grafana/Loki/Alloy) land on the new 4th worker; CI's system test will surface scheduling pressure.
healthchecks.io is the one external dependency (in-cluster Grafana can't cover full-cluster-down). A self-hosted GitHub Actions probe is noted as an alternative in the docs.

🤖 Generated with Claude Code

Harden the per-cluster observability stack across alerting, durability, logs and access, replacing the previous alerting-only / no-Grafana / no-remote-write posture documented in docs/dr/alerting.md. Capacity: - ksail.prod.yaml: 4th static worker for the always-on observability tier. - Restore alertmanager_replicas 1 -> 2 (was a stabilization trim). Alerting: - Route alerts to Slack via native slack_configs (api_url_file). - Add a Watchdog dead-man's-switch: the always-firing alert is pushed to an external heartbeat monitor (Flux-substituted URL, invalid default so local/CI stay quiet) that notifies Slack if the cluster goes down. - Re-enable curated chart defaultRules (Watchdog + self-monitoring + workload health), disabling groups for unscraped control-plane components and the two overcommit alerts; keep platform-critical.yaml. Durability (hetzner overlay): - Persistent hcloud PVCs for Prometheus (20Gi) and Alertmanager (2Gi); raise the Prometheus memory limit to 1.5Gi. Velero's daily all-namespace backup already ships the monitoring namespace to R2 (24h RPO). Logs: - Add Loki (single-binary, 7d retention; hcloud PVC in prod) and Alloy (DaemonSet log shipper, node-local discovery to avoid duplication). Visibility & access: - Enable Grafana (anonymous Admin behind SSO, Prometheus + Loki datasources, provisioned dashboards). - Expose Grafana/Prometheus/Alertmanager/OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies. Docs: - Rewrite docs/dr/alerting.md for the new architecture, incl. on-call and the manual SOPS steps for the Slack webhook + heartbeat URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Converts the previously alerting-only observability stack into a production-ready self-hosted tier: adds Grafana, Loki + Alloy, persistent storage for Prometheus/Alertmanager/Loki on Hetzner, Slack alerts via native slack_configs, an external dead-man's-switch heartbeat (Watchdog → healthchecks.io), oauth2-proxy SSO HTTPRoutes for Grafana/Prometheus/Alertmanager/OpenCost, capacity bump to 4 static workers, and a full rewrite of docs/dr/alerting.md. Reverses several deliberately-minimal choices documented earlier.

Changes:

Phase 0/1 — capacity (workers: 3 → 4, alertmanager_replicas 1 → 2) and resilient alerting (Slack slack_configs, Watchdog heartbeat route, curated defaultRules).
Phase 2/3 — durable storage for Prometheus/Alertmanager/Loki via hetzner overlay PVCs; new Loki single-binary + Alloy DaemonSet log pipeline.
Phase 4 — Grafana enabled (anonymous Admin behind SSO, Loki datasource); HTTPRoutes + auth-proxy router entries + ReferenceGrant + NetworkPolicy expansions for Grafana/Prometheus/Alertmanager/OpenCost.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
ksail.prod.yaml	Bumps static workers 3→4 to host always-on observability tier.
k8s/clusters/prod/variables/variables-cluster-config-map.yaml	Restores `alertmanager_replicas` to `2`.
k8s/bases/infrastructure/external-secrets/external-secrets.yaml	Comment clarifying split between webhook URL (ESO) and heartbeat URL (Flux substitution).
k8s/bases/infrastructure/controllers/kustomization.yaml	Registers new `alloy/` and `loki/` bases.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml	Enables Grafana, switches receivers to Slack + heartbeat, enables curated `defaultRules`.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/httproute.yaml	New HTTPRoutes for Grafana/Prometheus/Alertmanager via oauth2-proxy.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/kustomization.yaml	Includes new `httproute.yaml`.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/networkpolicy.yaml	Allows oauth2-proxy ingress on 3000/9090/9093.
k8s/bases/infrastructure/controllers/loki/{helm-release,helm-repository,kustomization}.yaml	New Loki single-binary release (7d retention, ServiceMonitor on).
k8s/bases/infrastructure/controllers/alloy/{helm-release,kustomization}.yaml	New Alloy DaemonSet with node-local pod-log discovery, push to Loki.
k8s/bases/infrastructure/controllers/opencost/{httproute,kustomization,networkpolicy}.yaml	Exposes OpenCost UI via SSO and allows oauth2-proxy ingress.
k8s/bases/infrastructure/controllers/oauth2-proxy/reference-grant.yaml	Extends grant to `monitoring` and `opencost` HTTPRoutes.
k8s/bases/infrastructure/controllers/auth-proxy/config-map.yaml	Adds Traefik routers/services for grafana/prometheus/alertmanager/opencost.
k8s/bases/infrastructure/controllers/auth-proxy/networkpolicy.yaml	Adds egress from auth-proxy to monitoring/opencost upstreams.
k8s/providers/hetzner/infrastructure/controllers/kustomization.yaml	Wires in new kube-prometheus-stack and loki patches.
k8s/providers/hetzner/infrastructure/controllers/kube-prometheus-stack/patches/helm-release-patch.yaml	hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus mem limit 1.5Gi.
k8s/providers/hetzner/infrastructure/controllers/loki/patches/helm-release-patch.yaml	hcloud 10Gi PVC for Loki.
docs/dr/alerting.md	Rewrites docs to match the new production-ready posture.

Replace the REPLACE_ME placeholder in alertmanager_webhook_url with a real Slack #platform-alerts incoming webhook so prod alerting actually delivers (it never did under the prior Discord-aspirational design). Also set alertmanager_heartbeat_url to the healthchecks.io ping URL that backs the Watchdog dead-man's-switch. The check's Slack integration plus its period/grace are configured on the healthchecks.io side. Both values are SOPS-encrypted in place; no other keys in the secret are touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

…ski-8cb4ed # Conflicts: # k8s/bases/infrastructure/controllers/opencost/networkpolicy.yaml

…epeat_interval) Alertmanager re-notifies only when both group_interval AND repeat_interval have elapsed, so group_interval is a floor on the effective cadence. With group_interval: 1m + repeat_interval: 50s the heartbeat was silently throttled to >=60s, contradicting the stated ~50s target. Drop group_interval to 30s so the 50s cadence actually applies. Functionally moot against a healthchecks.io 5m period / 10m grace, but makes the configuration match the comment / docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.

`singleBinary.persistence.enabled: false` is a Loki-chart footgun: it means *no* volume at the configured `path_prefix: /var/loki`, not an emptyDir. With the chart's default `readOnlyRootFilesystem: true` the loki container then crashes writing chunks/WAL, the StatefulSet stalls, and Flux marks the HelmRelease InstallFailed -- exactly what the system test saw on this PR. Switch the base to `persistence.enabled: true` with a small 5 GiB PVC against the cluster's default storage class (local-path on the docker provider, mirroring how OpenBao runs locally). The hetzner overlay patch already overrides this to `storageClass: hcloud` + 10 GiB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ski-8cb4ed # Conflicts: # ksail.prod.yaml

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 27, 2026 21:49

github-project-automation Bot added this to 🌊 Project Board May 27, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 27, 2026

Copilot started reviewing on behalf of devantler May 27, 2026 21:49 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

devantler marked this pull request as ready for review May 28, 2026 05:56

Copilot AI review requested due to automatic review settings May 28, 2026 05:56

Copilot started reviewing on behalf of devantler May 28, 2026 05:56 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml Outdated

Merge remote-tracking branch 'origin/main' into claude/musing-kowalev…

0069bfb

…ski-8cb4ed # Conflicts: # k8s/bases/infrastructure/controllers/opencost/networkpolicy.yaml

botantler Bot approved these changes May 28, 2026

View reviewed changes

botantler Bot enabled auto-merge May 28, 2026 05:59

devantler had a problem deploying to ci May 28, 2026 05:59 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings May 28, 2026 06:01

Copilot started reviewing on behalf of devantler May 28, 2026 06:02 View session

botantler Bot approved these changes May 28, 2026

View reviewed changes

devantler had a problem deploying to ci May 28, 2026 06:02 — with GitHub Actions Failure

Copilot AI reviewed May 28, 2026

View reviewed changes

botantler Bot approved these changes May 28, 2026

View reviewed changes

devantler temporarily deployed to ci May 28, 2026 07:06 — with GitHub Actions Inactive

botantler Bot added this pull request to the merge queue May 28, 2026