From 177d0936fbb72cc954555f5caa528546601deb81 Mon Sep 17 00:00:00 2001 From: Nikolai Emil Damm Date: Fri, 29 May 2026 08:06:46 +0200 Subject: [PATCH] fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause of the System Test "Job/openbao/vault-config status: 'InProgress' (HealthCheckFailed)" flake (PR #1636 run 26603473269): The openbao-helm 0.28.3 chart's default readinessProbe is `exec: bao status -tls-skip-verify` (server-statefulset.yaml:157-178), which returns exit code 2 on a sealed server. On a fresh cluster, the StatefulSet pod therefore stays NotReady until something unseals it. That "something" is the vault-config Job, which lives in the downstream 'infrastructure' Flux Kustomization and is gated on 'infrastructure-controllers' (which contains this HelmRelease) becoming Ready first. Flux's HelmController uses --wait by default (install.disableWait: false), so the HelmRelease cannot converge while the pod is NotReady; install.remediation.retries: -1 then drives an endless install -> wait timeout -> uninstall -> reinstall churn for the full bootstrap window. Bootstrap only escapes via a fragile race between Flux retries and the Job pod eventually catching a transient window where the OpenBao server is listening — historically 20-40 min, as the Job's backoffLimit=30 comment notes. Setting readinessProbe.path makes the chart template render the httpGet branch instead of the exec branch: {{- if .Values.server.readinessProbe.path }} httpGet: path: {{ .Values.server.readinessProbe.path | quote }} port: {{ .Values.server.readinessProbe.port }} scheme: {{ include "openbao.scheme" . | upper }} With sealedcode=204 and uninitcode=204, the /sys/health endpoint returns HTTP 204 even on a sealed-and-uninitialized server, so the Pod reports Ready as soon as the listener is up. The HelmRelease then converges Ready on first install, infrastructure-controllers becomes Ready, the infrastructure layer runs, and the vault-config Job completes in ~1-2 min instead of waiting 20-40 min for the deadlock to self-resolve. Scheme handling: 'openbao.scheme' returns 'http' when global.tlsDisable: true (chart default; matches our 'tls_disable = 1' listener), so the probe stays HTTP — no TLS plumbing required. The chart's livenessProbe defaults to enabled: false, so no parallel liveness fix is needed. This is the same pattern HashiCorp's official Vault Helm chart uses for the same reason (see vault-helm/values.yaml: readinessProbe.path defaults to '/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204'). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../controllers/openbao/helm-release.yaml | 20 +++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml b/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml index b9658d150..76934e72c 100644 --- a/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml +++ b/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml @@ -27,6 +27,26 @@ spec: server: enabled: true replicas: ${openbao_replicas:=1} + # The chart's default readinessProbe is `exec: bao status`, which + # returns exit code 2 on a sealed server -- so the Pod stays NotReady + # until something unseals it. On a fresh cluster that "something" is + # the vault-config Job in the downstream `infrastructure` Flux layer, + # which can't run until this HelmRelease (in `infrastructure-controllers`) + # is Ready. The install therefore deadlocks: Helm waits up to `timeout`, + # `install.remediation.retries: -1` uninstalls + reinstalls, repeat -- + # cold-cluster bootstrap historically took 20-40 min waiting for that + # race to resolve (see vault-config Job comment + PR #1636 system-test + # failure). + # + # Switching to the HTTP health probe with `sealedcode=204` and + # `uninitcode=204` makes a sealed-but-running server Ready immediately. + # The chart template renders the httpGet branch as soon as `path` is + # set (server-statefulset.yaml: `{{- if .Values.server.readinessProbe.path }}`) + # and `openbao.scheme` returns "http" when `global.tlsDisable: true` + # (chart default; matches the `tls_disable = 1` listener below). + # The HashiCorp Vault chart uses the same pattern for the same reason. + readinessProbe: + path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204" # Mount the unseal key Secret (created by the vault-config Job on first # init, restored by Velero on cluster rebuild) so the postStart hook # can auto-unseal the server after every pod restart.