fix(ci): raise local infrastructure Flux timeout to 20m for vault-config bootstrap#1648
Closed
devantler wants to merge 1 commit into
Closed
fix(ci): raise local infrastructure Flux timeout to 20m for vault-config bootstrap#1648devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
…fig bootstrap The vault-config Job in k8s/bases/infrastructure/vault-config/job.yaml bootstraps OpenBao (KV engine, Kubernetes auth, policies, roles) and is depended on intra-Kustomization by every consumer of the 'openbao' ClusterSecretStore — the ExternalSecrets and vault-seed PushSecrets in the same 'infrastructure' Flux Kustomization. On cold Docker CI runners the Job legitimately needs longer than the 3m base health-check budget (the Job's own activeDeadlineSeconds is 3600s by design), causing HealthCheckFailed flakes: Kustomization/flux-system/infrastructure — health check failed after 3m0.4s: timeout waiting for: [Job/openbao/vault-config status: 'InProgress'] (HealthCheckFailed) (observed on PR #1636 system-test run 26603473269; the PR itself only touched homepage/headlamp/actual-budget HelmReleases.) Patch the local cluster overlay's 'infrastructure' Kustomization timeout to 20m, matching prod's existing patch value for the same Kustomization and following the established overlay-patch pattern in this repo (local already patches apps->20m and infrastructure-controllers->12m for the same cold-Docker-CI rationale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Closing while investigating whether the underlying slowness is fixable instead of worked around. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The base
infrastructureFlux Kustomization at k8s/bases/cluster/infrastructure-flux-kustomization.yaml hastimeout: 3m, wait: true. One of its child resources isJob/openbao/vault-config(defined at k8s/bases/infrastructure/vault-config/job.yaml) whose own settings includebackoffLimit: 30andactiveDeadlineSeconds: 3600with the explicit comment that "openbao bootstrap can take 20-40 min on cold clusters".On cold Docker CI runners the System Test sporadically fails with:
(observed on PR #1636 system-test run 26603473269; that PR only modified homepage/headlamp/actual-budget HelmReleases — unrelated to the failure.)
Dependency model — who waits on vault-config?
The vault-config Job bootstraps OpenBao: enables the KV v2 engine, configures Kubernetes auth, writes policies, and creates the
external-secretsandfleetdm-mysql-rotatedauth roles. Everything that consumes theopenbaoClusterSecretStore(k8s/bases/infrastructure/cluster-secret-stores/cluster-secret-store.yaml) depends on that role existing — concretely the vault-seed PushSecrets (k8s/bases/infrastructure/vault-seed/) and every OpenBao-backed ExternalSecret under k8s/bases/infrastructure/external-secrets/ and k8s/bases/infrastructure/vault-config/external-secret.yaml. All of these live in the sameinfrastructureFlux Kustomization, so they share one health-check budget.Options considered
infrastructureKustomization timeout to accommodate the Job's worst-case cold start.infrastructureon the tighter 3m budget.activeDeadlineSeconds: 3600was set generously on purpose.Chose option 1, applied as a per-cluster patch. It follows the established overlay-patch convention in this repo:
infrastructure→20m with the same rationale ("3m base timeout" insufficient for heavier bootstrap components).apps→20m andinfrastructure-controllers→12m for the same cold-Docker-CI reason; this PR adds the missing peer patch forinfrastructure.Option 2 would mean per-provider vault-config overlays (Docker + Hetzner), a new Flux Kustomization, and rewired
dependsOnedges — significant overlay surface for one Job, when a single timeout knob expresses the same intent and prod has been running with that value reliably.Change
A patch on the local cluster overlay setting
infrastructuretimeout to 20m, with a comment linking the patch to the Job'sactiveDeadlineSecondsand the observed CI failure.retryInterval: 2mis unchanged — a genuinely broken resource still surfaces on the next retry, only the initial wait window expands. The Job is idempotent (thekustomize.toolkit.fluxcd.io/force: enabledannotation triggers Flux to delete-and-recreate on spec changes; the script checks state before every write), so longer waits are safe.Validation
Static checks only — does not require a cluster (per AGENTS.md):
Rendered Flux Kustomization timeouts after the change:
The full system-test will run in CI on this PR.
Root cause vs symptom
This patch addresses the symptom (health-check timeout too tight). The underlying root cause — cold-start latency for
vault-config— is intrinsic to bootstrapping OpenBao from zero (waiting for the OpenBao Pod to schedule + image to pull + server to listen, then a fast idempotent script). The Job is already designed around that withactiveDeadlineSeconds: 3600andbackoffLimit: 30; the Flux Kustomization just needs to match the realistic wait window.🤖 Generated with Claude Code