Skip to content

fix(ci): raise local infrastructure Flux timeout to 20m for vault-config bootstrap#1648

Closed
devantler wants to merge 1 commit into
mainfrom
claude/jovial-wescoff-5aa478
Closed

fix(ci): raise local infrastructure Flux timeout to 20m for vault-config bootstrap#1648
devantler wants to merge 1 commit into
mainfrom
claude/jovial-wescoff-5aa478

Conversation

@devantler
Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Problem

The base infrastructure Flux Kustomization at k8s/bases/cluster/infrastructure-flux-kustomization.yaml has timeout: 3m, wait: true. One of its child resources is Job/openbao/vault-config (defined at k8s/bases/infrastructure/vault-config/job.yaml) whose own settings include backoffLimit: 30 and activeDeadlineSeconds: 3600 with the explicit comment that "openbao bootstrap can take 20-40 min on cold clusters".

On cold Docker CI runners the System Test sporadically fails with:

Kustomization/flux-system/infrastructure — health check failed after 3m0.432401941s:
  timeout waiting for: [Job/openbao/vault-config status: 'InProgress'] (HealthCheckFailed)

(observed on PR #1636 system-test run 26603473269; that PR only modified homepage/headlamp/actual-budget HelmReleases — unrelated to the failure.)

Dependency model — who waits on vault-config?

The vault-config Job bootstraps OpenBao: enables the KV v2 engine, configures Kubernetes auth, writes policies, and creates the external-secrets and fleetdm-mysql-rotated auth roles. Everything that consumes the openbao ClusterSecretStore (k8s/bases/infrastructure/cluster-secret-stores/cluster-secret-store.yaml) depends on that role existing — concretely the vault-seed PushSecrets (k8s/bases/infrastructure/vault-seed/) and every OpenBao-backed ExternalSecret under k8s/bases/infrastructure/external-secrets/ and k8s/bases/infrastructure/vault-config/external-secret.yaml. All of these live in the same infrastructure Flux Kustomization, so they share one health-check budget.

Options considered

  1. Raise infrastructure Kustomization timeout to accommodate the Job's worst-case cold start.
  2. Split vault-config into its own Flux Kustomization with a longer timeout, leaving the rest of infrastructure on the tighter 3m budget.
  3. Reduce vault-config bootstrap time — explicitly rejected: activeDeadlineSeconds: 3600 was set generously on purpose.

Chose option 1, applied as a per-cluster patch. It follows the established overlay-patch convention in this repo:

Option 2 would mean per-provider vault-config overlays (Docker + Hetzner), a new Flux Kustomization, and rewired dependsOn edges — significant overlay surface for one Job, when a single timeout knob expresses the same intent and prod has been running with that value reliably.

Change

A patch on the local cluster overlay setting infrastructure timeout to 20m, with a comment linking the patch to the Job's activeDeadlineSeconds and the observed CI failure.

retryInterval: 2m is unchanged — a genuinely broken resource still surfaces on the next retry, only the initial wait window expands. The Job is idempotent (the kustomize.toolkit.fluxcd.io/force: enabled annotation triggers Flux to delete-and-recreate on spec changes; the script checks state before every write), so longer waits are safe.

Validation

Static checks only — does not require a cluster (per AGENTS.md):

$ ksail workload validate
✔ 256 files validated

$ ksail --config ksail.prod.yaml workload validate
✔ 256 files validated

Rendered Flux Kustomization timeouts after the change:

Layer local (before) local (after) prod
variables 3m 3m 3m
infrastructure-controllers 12m 12m 25m
infrastructure 3m 20m 20m
apps 20m 20m 20m

The full system-test will run in CI on this PR.

Root cause vs symptom

This patch addresses the symptom (health-check timeout too tight). The underlying root cause — cold-start latency for vault-config — is intrinsic to bootstrapping OpenBao from zero (waiting for the OpenBao Pod to schedule + image to pull + server to listen, then a fast idempotent script). The Job is already designed around that with activeDeadlineSeconds: 3600 and backoffLimit: 30; the Flux Kustomization just needs to match the realistic wait window.

🤖 Generated with Claude Code

…fig bootstrap

The vault-config Job in k8s/bases/infrastructure/vault-config/job.yaml
bootstraps OpenBao (KV engine, Kubernetes auth, policies, roles) and is
depended on intra-Kustomization by every consumer of the 'openbao'
ClusterSecretStore — the ExternalSecrets and vault-seed PushSecrets in
the same 'infrastructure' Flux Kustomization. On cold Docker CI runners
the Job legitimately needs longer than the 3m base health-check budget
(the Job's own activeDeadlineSeconds is 3600s by design), causing
HealthCheckFailed flakes:

  Kustomization/flux-system/infrastructure — health check failed after
  3m0.4s: timeout waiting for: [Job/openbao/vault-config status:
  'InProgress'] (HealthCheckFailed)

(observed on PR #1636 system-test run 26603473269; the PR itself only
touched homepage/headlamp/actual-budget HelmReleases.)

Patch the local cluster overlay's 'infrastructure' Kustomization
timeout to 20m, matching prod's existing patch value for the same
Kustomization and following the established overlay-patch pattern in
this repo (local already patches apps->20m and
infrastructure-controllers->12m for the same cold-Docker-CI rationale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@devantler
Copy link
Copy Markdown
Contributor Author

Closing while investigating whether the underlying slowness is fixable instead of worked around.

@devantler devantler closed this May 29, 2026
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 29, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants