Skip to content

feat(observability): make the stack production-ready#1604

Open
devantler wants to merge 6 commits into
mainfrom
claude/musing-kowalevski-8cb4ed
Open

feat(observability): make the stack production-ready#1604
devantler wants to merge 6 commits into
mainfrom
claude/musing-kowalevski-8cb4ed

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Why

The observability stack was deliberately minimal — alerting-only, no Grafana, no remote-write, single ephemeral Prometheus, in-cluster-only alerting (documented in docs/dr/alerting.md). This makes it production-ready while staying self-hosted (no SaaS metrics tier). Direction was agreed interactively; the work reverses several of those documented choices, so docs/dr/alerting.md is rewritten to match.

Done in phases; all phases are in this PR. Happy to split into separate PRs if preferred.

Phase 0 — Capacity

  • ksail.prod.yaml: static workers: 3 → 4. The new always-on tier (Grafana/Loki/persistent Prometheus+Alertmanager) belongs on guaranteed static capacity, not autoscaler nodes the autoscaler reclaims. The 4th worker auto-joins Longhorn via the uniform worker node-label; longhorn_replica_count stays 3.
  • Restore alertmanager_replicas 1 → 2 (was a #1585 stabilization trim; prod now has headroom per #1601).

Phase 1 — Resilient alerting

  • Alerts → Slack via native slack_configs (api_url_file).
  • Dead-man's-switch: the always-firing Watchdog is routed to a heartbeat receiver that POSTs to an external monitor every ~50s. If the cluster/alerting pipeline dies, the monitor (e.g. healthchecks.io) notifies Slack out-of-band — the one failure in-cluster alerting can't cover. URL is Flux-substituted with an invalid default, so local/CI stay quiet.
  • Re-enable curated defaultRules (Watchdog + self-monitoring + kubernetesApps/storage/node), disabling groups for unscraped control-plane components and the noisy overcommit alerts. platform-critical.yaml (Velero/CNPG/Flux/cert/autoscaler) is unchanged.

Phase 2 — Durability

  • hetzner overlay: persistent hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus memory limit → 1.5Gi (VPA manages requests only, so the limit is the real OOM guard).
  • Off-cluster backup: Velero's daily *-namespace backup already covers monitoring, so the new PVCs ship to R2 at 24h RPO with no Velero change.

Phase 3 — Centralized logs

  • Loki (single-binary, 7d retention; hcloud PVC in prod, ephemeral local) + Alloy DaemonSet shipper (node-local discovery → no log duplication; tails via the API, no privileged hostPath).

Phase 4 — Visibility & access

  • Grafana enabled (anonymous Admin behind the SSO gate, Prometheus + Loki datasources, provisioned dashboards, ephemeral).
  • Expose Grafana / Prometheus / Alertmanager / OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies.

Manual steps required before this fully works in prod

The agent cannot edit *.enc.yaml. After merge, set the per-cluster secrets (see docs/dr/alerting.md):

sops --set '["stringData"]["alertmanager_webhook_url"] "<slack-incoming-webhook>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml
sops --set '["stringData"]["alertmanager_heartbeat_url"] "<healthchecks.io-ping-url>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml

Then create the Slack #platform-alerts incoming webhook and the healthchecks.io check (period ~5m, grace ~10m, Slack integration). Until set, alerts/heartbeat degrade gracefully to invalid URLs (no breakage).

Validation

  • kubectl kustomize k8s/clusters/local/ and …/prod/ build.
  • ksail workload validate and ksail --config ksail.prod.yaml workload validate259 files validated.
  • Loki 6.55.0 / Alloy 1.8.2 values helm template-verified (service names/ports, RBAC for pods/log).
  • Full Talos+Docker system test runs in CI.

Risks

  • Memory: heaviest additions (Grafana/Loki/Alloy) land on the new 4th worker; CI's system test will surface scheduling pressure.
  • healthchecks.io is the one external dependency (in-cluster Grafana can't cover full-cluster-down). A self-hosted GitHub Actions probe is noted as an alternative in the docs.

🤖 Generated with Claude Code

Harden the per-cluster observability stack across alerting, durability,
logs and access, replacing the previous alerting-only / no-Grafana /
no-remote-write posture documented in docs/dr/alerting.md.

Capacity:
- ksail.prod.yaml: 4th static worker for the always-on observability tier.
- Restore alertmanager_replicas 1 -> 2 (was a stabilization trim).

Alerting:
- Route alerts to Slack via native slack_configs (api_url_file).
- Add a Watchdog dead-man's-switch: the always-firing alert is pushed to
  an external heartbeat monitor (Flux-substituted URL, invalid default so
  local/CI stay quiet) that notifies Slack if the cluster goes down.
- Re-enable curated chart defaultRules (Watchdog + self-monitoring +
  workload health), disabling groups for unscraped control-plane
  components and the two overcommit alerts; keep platform-critical.yaml.

Durability (hetzner overlay):
- Persistent hcloud PVCs for Prometheus (20Gi) and Alertmanager (2Gi);
  raise the Prometheus memory limit to 1.5Gi. Velero's daily all-namespace
  backup already ships the monitoring namespace to R2 (24h RPO).

Logs:
- Add Loki (single-binary, 7d retention; hcloud PVC in prod) and Alloy
  (DaemonSet log shipper, node-local discovery to avoid duplication).

Visibility & access:
- Enable Grafana (anonymous Admin behind SSO, Prometheus + Loki
  datasources, provisioned dashboards).
- Expose Grafana/Prometheus/Alertmanager/OpenCost behind oauth2-proxy via
  HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the
  monitoring/opencost/auth-proxy CiliumNetworkPolicies.

Docs:
- Rewrite docs/dr/alerting.md for the new architecture, incl. on-call and
  the manual SOPS steps for the Slack webhook + heartbeat URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 21:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Converts the previously alerting-only observability stack into a production-ready self-hosted tier: adds Grafana, Loki + Alloy, persistent storage for Prometheus/Alertmanager/Loki on Hetzner, Slack alerts via native slack_configs, an external dead-man's-switch heartbeat (Watchdog → healthchecks.io), oauth2-proxy SSO HTTPRoutes for Grafana/Prometheus/Alertmanager/OpenCost, capacity bump to 4 static workers, and a full rewrite of docs/dr/alerting.md. Reverses several deliberately-minimal choices documented earlier.

Changes:

  • Phase 0/1 — capacity (workers: 3 → 4, alertmanager_replicas 1 → 2) and resilient alerting (Slack slack_configs, Watchdog heartbeat route, curated defaultRules).
  • Phase 2/3 — durable storage for Prometheus/Alertmanager/Loki via hetzner overlay PVCs; new Loki single-binary + Alloy DaemonSet log pipeline.
  • Phase 4 — Grafana enabled (anonymous Admin behind SSO, Loki datasource); HTTPRoutes + auth-proxy router entries + ReferenceGrant + NetworkPolicy expansions for Grafana/Prometheus/Alertmanager/OpenCost.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ksail.prod.yaml Bumps static workers 3→4 to host always-on observability tier.
k8s/clusters/prod/variables/variables-cluster-config-map.yaml Restores alertmanager_replicas to 2.
k8s/bases/infrastructure/external-secrets/external-secrets.yaml Comment clarifying split between webhook URL (ESO) and heartbeat URL (Flux substitution).
k8s/bases/infrastructure/controllers/kustomization.yaml Registers new alloy/ and loki/ bases.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml Enables Grafana, switches receivers to Slack + heartbeat, enables curated defaultRules.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/httproute.yaml New HTTPRoutes for Grafana/Prometheus/Alertmanager via oauth2-proxy.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/kustomization.yaml Includes new httproute.yaml.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/networkpolicy.yaml Allows oauth2-proxy ingress on 3000/9090/9093.
k8s/bases/infrastructure/controllers/loki/{helm-release,helm-repository,kustomization}.yaml New Loki single-binary release (7d retention, ServiceMonitor on).
k8s/bases/infrastructure/controllers/alloy/{helm-release,kustomization}.yaml New Alloy DaemonSet with node-local pod-log discovery, push to Loki.
k8s/bases/infrastructure/controllers/opencost/{httproute,kustomization,networkpolicy}.yaml Exposes OpenCost UI via SSO and allows oauth2-proxy ingress.
k8s/bases/infrastructure/controllers/oauth2-proxy/reference-grant.yaml Extends grant to monitoring and opencost HTTPRoutes.
k8s/bases/infrastructure/controllers/auth-proxy/config-map.yaml Adds Traefik routers/services for grafana/prometheus/alertmanager/opencost.
k8s/bases/infrastructure/controllers/auth-proxy/networkpolicy.yaml Adds egress from auth-proxy to monitoring/opencost upstreams.
k8s/providers/hetzner/infrastructure/controllers/kustomization.yaml Wires in new kube-prometheus-stack and loki patches.
k8s/providers/hetzner/infrastructure/controllers/kube-prometheus-stack/patches/helm-release-patch.yaml hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus mem limit 1.5Gi.
k8s/providers/hetzner/infrastructure/controllers/loki/patches/helm-release-patch.yaml hcloud 10Gi PVC for Loki.
docs/dr/alerting.md Rewrites docs to match the new production-ready posture.

Replace the REPLACE_ME placeholder in alertmanager_webhook_url with a
real Slack #platform-alerts incoming webhook so prod alerting actually
delivers (it never did under the prior Discord-aspirational design).

Also set alertmanager_heartbeat_url to the healthchecks.io ping URL that
backs the Watchdog dead-man's-switch. The check's Slack integration plus
its period/grace are configured on the healthchecks.io side.

Both values are SOPS-encrypted in place; no other keys in the secret are
touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review May 28, 2026 05:56
Copilot AI review requested due to automatic review settings May 28, 2026 05:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Comment thread k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml Outdated
…ski-8cb4ed

# Conflicts:
#	k8s/bases/infrastructure/controllers/opencost/networkpolicy.yaml
…epeat_interval)

Alertmanager re-notifies only when both group_interval AND repeat_interval
have elapsed, so group_interval is a floor on the effective cadence. With
group_interval: 1m + repeat_interval: 50s the heartbeat was silently
throttled to >=60s, contradicting the stated ~50s target. Drop
group_interval to 30s so the 50s cadence actually applies.

Functionally moot against a healthchecks.io 5m period / 10m grace, but
makes the configuration match the comment / docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 06:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.

`singleBinary.persistence.enabled: false` is a Loki-chart footgun: it
means *no* volume at the configured `path_prefix: /var/loki`, not an
emptyDir. With the chart's default `readOnlyRootFilesystem: true` the
loki container then crashes writing chunks/WAL, the StatefulSet stalls,
and Flux marks the HelmRelease InstallFailed -- exactly what the system
test saw on this PR.

Switch the base to `persistence.enabled: true` with a small 5 GiB PVC
against the cluster's default storage class (local-path on the docker
provider, mirroring how OpenBao runs locally). The hetzner overlay
patch already overrides this to `storageClass: hcloud` + 10 GiB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@botantler botantler Bot added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Copilot AI review requested due to automatic review settings May 29, 2026 15:39
@botantler botantler Bot enabled auto-merge May 29, 2026 15:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.

@botantler botantler Bot added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants