From 177d0936fbb72cc954555f5caa528546601deb81 Mon Sep 17 00:00:00 2001
From: Nikolai Emil Damm <nikolaiemildamm@icloud.com>
Date: Fri, 29 May 2026 08:06:46 +0200
Subject: [PATCH] fix(openbao): switch readinessProbe to HTTP /sys/health to
 unblock bootstrap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Root cause of the System Test "Job/openbao/vault-config status:
'InProgress' (HealthCheckFailed)" flake (PR #1636 run 26603473269):

The openbao-helm 0.28.3 chart's default readinessProbe is
`exec: bao status -tls-skip-verify` (server-statefulset.yaml:157-178),
which returns exit code 2 on a sealed server. On a fresh cluster, the
StatefulSet pod therefore stays NotReady until something unseals it.
That "something" is the vault-config Job, which lives in the
downstream 'infrastructure' Flux Kustomization and is gated on
'infrastructure-controllers' (which contains this HelmRelease) becoming
Ready first. Flux's HelmController uses --wait by default
(install.disableWait: false), so the HelmRelease cannot converge while
the pod is NotReady; install.remediation.retries: -1 then drives an
endless install -> wait timeout -> uninstall -> reinstall churn for the
full bootstrap window. Bootstrap only escapes via a fragile race
between Flux retries and the Job pod eventually catching a transient
window where the OpenBao server is listening — historically 20-40 min,
as the Job's backoffLimit=30 comment notes.

Setting readinessProbe.path makes the chart template render the
httpGet branch instead of the exec branch:

  {{- if .Values.server.readinessProbe.path }}
  httpGet:
    path: {{ .Values.server.readinessProbe.path | quote }}
    port: {{ .Values.server.readinessProbe.port }}
    scheme: {{ include "openbao.scheme" . | upper }}

With sealedcode=204 and uninitcode=204, the /sys/health endpoint
returns HTTP 204 even on a sealed-and-uninitialized server, so the
Pod reports Ready as soon as the listener is up. The HelmRelease then
converges Ready on first install, infrastructure-controllers becomes
Ready, the infrastructure layer runs, and the vault-config Job
completes in ~1-2 min instead of waiting 20-40 min for the deadlock to
self-resolve.

Scheme handling: 'openbao.scheme' returns 'http' when
global.tlsDisable: true (chart default; matches our 'tls_disable = 1'
listener), so the probe stays HTTP — no TLS plumbing required. The
chart's livenessProbe defaults to enabled: false, so no parallel
liveness fix is needed.

This is the same pattern HashiCorp's official Vault Helm chart uses
for the same reason
(see vault-helm/values.yaml: readinessProbe.path defaults to
'/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../controllers/openbao/helm-release.yaml     | 20 +++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml b/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml
index b9658d150..76934e72c 100644
--- a/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml
+++ b/k8s/bases/infrastructure/controllers/openbao/helm-release.yaml
@@ -27,6 +27,26 @@ spec:
     server:
       enabled: true
       replicas: ${openbao_replicas:=1}
+      # The chart's default readinessProbe is `exec: bao status`, which
+      # returns exit code 2 on a sealed server -- so the Pod stays NotReady
+      # until something unseals it. On a fresh cluster that "something" is
+      # the vault-config Job in the downstream `infrastructure` Flux layer,
+      # which can't run until this HelmRelease (in `infrastructure-controllers`)
+      # is Ready. The install therefore deadlocks: Helm waits up to `timeout`,
+      # `install.remediation.retries: -1` uninstalls + reinstalls, repeat --
+      # cold-cluster bootstrap historically took 20-40 min waiting for that
+      # race to resolve (see vault-config Job comment + PR #1636 system-test
+      # failure).
+      #
+      # Switching to the HTTP health probe with `sealedcode=204` and
+      # `uninitcode=204` makes a sealed-but-running server Ready immediately.
+      # The chart template renders the httpGet branch as soon as `path` is
+      # set (server-statefulset.yaml: `{{- if .Values.server.readinessProbe.path }}`)
+      # and `openbao.scheme` returns "http" when `global.tlsDisable: true`
+      # (chart default; matches the `tls_disable = 1` listener below).
+      # The HashiCorp Vault chart uses the same pattern for the same reason.
+      readinessProbe:
+        path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
       # Mount the unseal key Secret (created by the vault-config Job on first
       # init, restored by Velero on cluster rebuild) so the postStart hook
       # can auto-unseal the server after every pod restart.