fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap#1656
Open
devantler wants to merge 1 commit into
Open
fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap#1656devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
…otstrap Root cause of the System Test "Job/openbao/vault-config status: 'InProgress' (HealthCheckFailed)" flake (PR #1636 run 26603473269): The openbao-helm 0.28.3 chart's default readinessProbe is `exec: bao status -tls-skip-verify` (server-statefulset.yaml:157-178), which returns exit code 2 on a sealed server. On a fresh cluster, the StatefulSet pod therefore stays NotReady until something unseals it. That "something" is the vault-config Job, which lives in the downstream 'infrastructure' Flux Kustomization and is gated on 'infrastructure-controllers' (which contains this HelmRelease) becoming Ready first. Flux's HelmController uses --wait by default (install.disableWait: false), so the HelmRelease cannot converge while the pod is NotReady; install.remediation.retries: -1 then drives an endless install -> wait timeout -> uninstall -> reinstall churn for the full bootstrap window. Bootstrap only escapes via a fragile race between Flux retries and the Job pod eventually catching a transient window where the OpenBao server is listening — historically 20-40 min, as the Job's backoffLimit=30 comment notes. Setting readinessProbe.path makes the chart template render the httpGet branch instead of the exec branch: {{- if .Values.server.readinessProbe.path }} httpGet: path: {{ .Values.server.readinessProbe.path | quote }} port: {{ .Values.server.readinessProbe.port }} scheme: {{ include "openbao.scheme" . | upper }} With sealedcode=204 and uninitcode=204, the /sys/health endpoint returns HTTP 204 even on a sealed-and-uninitialized server, so the Pod reports Ready as soon as the listener is up. The HelmRelease then converges Ready on first install, infrastructure-controllers becomes Ready, the infrastructure layer runs, and the vault-config Job completes in ~1-2 min instead of waiting 20-40 min for the deadlock to self-resolve. Scheme handling: 'openbao.scheme' returns 'http' when global.tlsDisable: true (chart default; matches our 'tls_disable = 1' listener), so the probe stays HTTP — no TLS plumbing required. The chart's livenessProbe defaults to enabled: false, so no parallel liveness fix is needed. This is the same pattern HashiCorp's official Vault Helm chart uses for the same reason (see vault-helm/values.yaml: readinessProbe.path defaults to '/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the OpenBao HelmRelease values to avoid a bootstrap deadlock caused by the chart’s default exec-based readiness probe failing while the server is sealed/uninitialized. By switching readiness to an HTTP /v1/sys/health endpoint that returns a 2xx/204 during sealed/uninitialized states, Flux/Helm can mark the release ready and allow the downstream vault-config Job to run promptly.
Changes:
- Override
server.readinessProbeto use an HTTP health endpoint (/v1/sys/health) withsealedcode=204anduninitcode=204. - Add in-file rationale documenting the Flux dependency deadlock this prevents.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
System Test on PR #1636 (run 26603473269) failed with:
PR #1636 only modified homepage/headlamp/actual-budget HelmReleases — the failure was unrelated to its diff. The
Job/openbao/vault-configregularly takes a long time to bootstrap on fresh clusters; thevault-configJob's own comment claims "openbao bootstrap can take 20-40 min on cold CI runners" and setsbackoffLimit: 30, activeDeadlineSeconds: 3600accordingly.My first attempt at fixing this was a per-cluster timeout patch (PR #1648, now closed) — bumping the local
infrastructureFlux Kustomization timeout to 20m. Reviewer push-back was correct: 30 min is extreme, not a normal cold-start time; this is a symptom, not the disease. This PR fixes the disease.Root cause
The openbao-helm 0.28.3 chart's default readinessProbe is
exec: bao status -tls-skip-verify(server-statefulset.yaml:157-178).bao statusexits 2 when sealed — so on a fresh cluster, the StatefulSet pod stays NotReady until something unseals it.That "something" is the
vault-configJob at k8s/bases/infrastructure/vault-config/job.yaml — which runs in the downstreaminfrastructureFlux Kustomization, gated oninfrastructure-controllers(which contains this HelmRelease) becoming Ready first. The chain:infrastructure-controllerswait: trueinfrastructurewait: true,dependsOn: infrastructure-controllersFlux's HelmController uses
--waitby default (install.disableWait: false), so the HelmRelease cannot converge while the pod is NotReady;install.remediation.retries: -1(helm-release.yaml:11-13) drives an endlessinstall → wait timeout → uninstall → reinstallchurn for the entire bootstrap window. Bootstrap only escapes via a fragile race where the Job pod eventually catches a transient window during the chart's install/uninstall thrash — historically 20-40 min.The Job's
vault-initinit container also has an unboundeduntil bao status … ; sleep 3; doneloop (job.yaml:90-95) with no timeout, so a single Pod can sit in init for up toactiveDeadlineSeconds: 3600(1 hour) before getting killed and retried — amplifying the race window.The fix
Setting
server.readinessProbe.pathmakes the chart template render the httpGet branch instead of the exec branch:The HashiCorp Vault Helm chart uses exactly this pattern for the same reason —
/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204makes the/sys/healthendpoint return HTTP 204 even when the server is sealed and uninitialized, so the Pod reports Ready as soon as the listener is up. The HelmRelease then converges Ready on first install,infrastructure-controllersbecomes Ready, theinfrastructurelayer runs, and the vault-config Job completes in ~1-2 min instead of waiting for the deadlock to self-resolve.Scheme handling
openbao.schemereturnshttpwhenglobal.tlsDisable: true(_helpers.tpl — chart default; matches ourtls_disable = 1listener config in helm-release.yaml:80-83), so the probe stays HTTP — no TLS plumbing required.LivenessProbe
The chart's
server.livenessProbe.enableddefaults tofalse(values.yaml:643), so there is no parallel liveness fix needed. (If liveness were enabled with the same exec, the kubelet would kill sealed pods — but it isn't.)Validation
Static checks only — no cluster (per AGENTS.md):
Rendered openbao HelmRelease shows the path landing correctly:
The full Talos+Docker system-test will run in CI on this PR — it should now pass in normal time without the 3m flake.
What this means for the existing safety nets
Once this fix is in, the Job's own
backoffLimit: 30andactiveDeadlineSeconds: 3600become massively oversized — they were sized for the 20-40 min race that no longer exists. I'm not lowering them in this PR (keeps the diff surgical, oversized safety nets are harmless), but a follow-up can bring them down to ~backoffLimit: 10, activeDeadlineSeconds: 600once a few CI runs confirm the new bootstrap time. The misleading "20-40 min on cold CI runners" comment in job.yaml:37-40 is also worth rewriting at that point — leaving it alone here to keep the diff a single, easily-reviewed file.Similarly, the local
infrastructureKustomization's 3m timeout should now be comfortable (the rest of the layer is fast-converging; the only slow resource was vault-config-as-a-symptom-of-OpenBao-not-Ready). No timeout bump needed.What this does NOT change
Only the chart's probe handler is switched from
exectohttpGet.🤖 Generated with Claude Code