feat(openbao): enable file audit device, shorten DB role rotation to 7 d#1626
Open
devantler wants to merge 9 commits into
Open
feat(openbao): enable file audit device, shorten DB role rotation to 7 d#1626devantler wants to merge 9 commits into
devantler wants to merge 9 commits into
Conversation
Two related OpenBao hardenings.
1. Enable the file audit device on the auditStorage PV.
The OpenBao chart provisions `auditStorage.enabled: true` (mounted at
/vault/audit per chart defaults), but the vault-config Job never ran
`bao audit enable` -- so the PV exists, is mounted, and is empty.
Every read of a Secret to date has been unrecorded.
Adds an idempotent block (section 7 in the Job) that runs:
bao audit enable file file_path=/vault/audit/audit.log
guarded by a `bao audit list | grep -q file/` check so re-runs don't
error. Now every OpenBao API call writes one JSON record per request.
The audit log can be tailed from the openbao pod today, and shipped
to Loki by a sidecar / promtail once the observability stack lands
(per the observability-production-ready memory).
Note: OpenBao blocks all writes if its audit log path is unwritable
(HashiCorp Vault behavioural compatibility). The PV mount handles
this in practice; the chart-managed PV is bound for the lifetime of
the StatefulSet.
2. Shorten the fleetdm MySQL static-role rotation default from 90 d to 7 d.
The 2160h default was chosen during initial bootstrap as a low-churn
value. With ESO + Reloader already validated to pick up rotations
without disruption, 168h (7 d) is HashiCorp's documented sweet spot:
long enough that ESO cache misses are rare, short enough that a
leaked credential's half-life is bounded.
Operator override is preserved via the fleetdm_mysql_rotation_period
cluster variable; only the *default* shifts.
3. Doc-only: top-of-file step list grows to cover step 6 (OIDC), step 7
(audit), step 8 (DB engine) -- the steps existed but the doc block
was stale.
Validation:
$ ksail workload validate → 255 files validated
$ ksail --config ksail.prod.yaml workload validate → 255 files validated
Deferred from this PR (left for separate work):
- Transit engine for app-layer encryption (no consumer in-flight to
drive the design; would be premature).
- Audit-log shipping to Loki (depends on the observability rollout).
Closes Phase 3.4 (partial — audit + rotation) of the public-repo
hardening series.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens OpenBao operations by enabling file-based audit logging and reducing the default FleetDM MySQL static-role rotation period.
Changes:
- Adds a vault-config step to enable the OpenBao file audit device at
/vault/audit/audit.log. - Updates the FleetDM MySQL static-role default rotation period from 90 days to 7 days.
- Refreshes the top-level job comments to match the script’s current configuration steps.
The audit-enable block I added in the previous commit pointed at
/vault/audit/audit.log — the upstream HashiCorp Vault chart's
convention. The OpenBao chart instead standardised on /openbao/*
paths to match its rebrand (the data path is /openbao/data, the
auditStorage mount defaults to /openbao/audit).
The vault-config Job's bao CLI runs against the server pod, and the
file_path is resolved on the *server* pod's filesystem — so the wrong
path caused the audit-enable step to fail (no such directory), and the
Job hit CrashLoopBackOff because the script aborts on error. The Flux
Kustomization 'infrastructure' then could not reconcile because its
health check waited indefinitely for the Job to reach Complete.
CI log excerpt:
openbao vault-config-mcbbr 0/1 Error 5
openbao vault-config-q92kx 0/1 CrashLoopBackOff 10
Kustomization/flux-system/infrastructure
timeout waiting for: [Job/openbao/vault-config status: 'InProgress']
Fix: change file_path to /openbao/audit/audit.log. Comment block
explains the /openbao vs /vault path convention so the next operator
doesn't re-introduce the bug. The chart's auditStorage PV already
provisions and mounts the directory; no chart-level changes needed.
Reviewed-by: Copilot review on PR #1626 (it caught this before CI did).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on this branch was failing with: Error enabling audit device: Error making API request. URL: PUT http://openbao.openbao.svc.cluster.local:8200/v1/sys/audit/file Code: 400. Errors: * cannot enable audit device via API; use declarative, config-based audit device management instead OpenBao does not allow enabling the audit device at runtime via the sys/audit API -- it requires the device to be declared in the server's HCL config alongside listener/storage. The vault-config Job's 'bao audit enable' call was therefore wrong by design and would never have worked against this OpenBao build. Fix: 1. openbao HelmRelease (standalone.config): add a declarative audit "file" { file_path = "/openbao/audit/audit.log" } stanza. /openbao/audit is the chart's auditStorage PV mount path (matches the /openbao/data data path). OpenBao reads this on startup; no API call needed. Every API request is logged to /openbao/audit/audit.log as one JSON record per line. 2. vault-config Job: drop the now-dead 'bao audit enable' block. Replace it with a comment explaining why this is declarative-only. Renumber the trailing 'Database secrets engine' section from 8 -> 7 in both the body and the top-of-file step list. The previous commit (1ade5f5) fixed the path from /vault to /openbao based on the chart default; this commit moves the configuration to the correct place (HCL config) so it actually takes effect. Validation: $ ksail workload validate → 256 files validated $ ksail --config ksail.prod.yaml workload validate → 256 files validated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenBao's file audit backend does not rotate, and OpenBao fails CLOSED
on audit-write errors (every API request blocks once the volume is
full). The chart default of 1Gi would silently degrade to a fully
sealed cluster after a few months at this cluster's request volume.
Changes:
- auditStorage.size: 1Gi -> 10Gi.
10Gi gives multi-year headroom for this cluster's traffic
(~700 KB/day from current ESO + vault-snapshot use). Variable
override matches the dataStorage idiom so fork operators can tune
per-cluster.
- Inline comment documents:
* the failure mode (fail-closed, blocks API);
* the rotation strategy until the observability stack ships the
audit stream off-PVC (a manual SIGHUP rotate from the openbao
pod);
* the metric to monitor while we're still file-backed.
This is a tactical sizing/documentation fix. Proper rotation +
shipping happens in the observability rollout (per the
observability-production-ready memory) -- promtail will consume
audit.log and the PVC sizing becomes irrelevant. Tracked as a
follow-up to this PR.
Validation:
$ ksail workload validate → 256 files validated
$ ksail --config ksail.prod.yaml workload validate → 256 files validated
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenBao parsed the audit stanza I added in 685bbfe and rejected it: error loading configuration from /tmp/storageconfig.hcl: error parsing 'audit': audit.0: audit type must be specified I had written it as if 'file' were the audit type: audit "file" { file_path = "/openbao/audit/audit.log" } But per the OpenBao docs (https://openbao.org/docs/configuration/audit) the label after 'audit' is an arbitrary identifier (it becomes the device's path under /sys/audit/<label>) and the backend type goes in a 'type' field. Backend-specific settings live under 'options'. Correct shape: audit "file" { type = "file" options = { file_path = "/openbao/audit/audit.log" } } The label happens to still be "file" here; that's the conventional name when you have only one file device, not a reserved keyword. Inline comment block links the docs and explains the shape so the next operator doesn't repeat the mistake. Without this fix the openbao server would refuse to start (config parse failure at boot), which then cascades: - vault-config Job's init container can't reach openbao; - every PushSecret / ExternalSecret stays in InProgress; - flux-system/infrastructure Kustomization health-check times out. CI was failing for exactly this reason. Validation: $ ksail workload validate → 256 files validated $ ksail --config ksail.prod.yaml workload validate → 256 files validated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The NOTE block in vault-config/job.yaml still showed the old (broken)
HCL form:
audit "file" { file_path = "/openbao/audit/audit.log" }
which the openbao server rejects with
audit type must be specified
Updated the comment to show the correct shape that actually lives in
the HelmRelease and to link the OpenBao docs:
audit "file" {
type = "file"
options = {
file_path = "/openbao/audit/audit.log"
}
}
See https://openbao.org/docs/configuration/audit
This prevents the next operator from copying the wrong syntax while
troubleshooting.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f9dc086 fixed the missing 'type' field, but OpenBao then raised the next parse error: error parsing 'audit': audit.0: audit path must be specified OpenBao's audit block requires BOTH 'type' and 'path' at config-parse time: audit "<id>" { type = "<device>" path = "<api mount path>" options = { ... } } 'path' is the OpenBao mount path used to address the device via /sys/audit/<path>/ and what 'bao audit list' reports. We use "file/" to match the conventional default of 'bao audit enable file'. The misleading label-vs-type-vs-path triple is exactly the thing Copilot caught in #3320939677, so updated the comments in both files to spell out that 'type' and 'path' are required and the label is just a stable identifier. Validation: $ ksail workload validate → 256 files validated $ ksail --config ksail.prod.yaml workload validate → 256 files validated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the security-hardening series (Phase 3.4, audit + rotation portion).
What
Three related changes to OpenBao's audit + DB engine setup.
1. Declare the file audit device in HCL config
The chart provisions an
auditStoragePVC (mounted at/openbao/auditper chart defaults) but had no audit device configured to write to it.OpenBao explicitly blocks runtime audit-device enables via the API:
So the device is declared in
openbao/helm-release.yamlstandalone.configHCL alongsidelistenerandstorage:Now every OpenBao API call writes one JSON record per request to
/openbao/audit/audit.log. Tail from the openbao pod today; ship via promtail once the observability stack lands.2. Bump auditStorage to 10Gi + document fail-closed mode
OpenBao's file audit backend does not rotate, and the server fails closed on audit-write errors — every API request blocks once the volume is full. The chart default of 1Gi was inadequate.
10Gi gives multi-year headroom at this cluster's ~700 KB/day traffic. The inline comment documents the failure mode and a manual SIGHUP-based rotation procedure for use until promtail ships the stream off-PVC (planned in the observability rollout).
3. Shorten fleetdm MySQL static-role rotation: 90 d → 7 d
Operator override via the
fleetdm_mysql_rotation_periodcluster variable is preserved; only the default shifts.4. Doc-only: refresh the top-of-file step listing in vault-config Job
Headers + step numbers updated to match the now-current shape (the audit-enable step is gone, leaving sections 1–7; what was section 8 is now 7).
Out of scope (deferred to a separate PR)
Validation
The HCL audit device is parsed by OpenBao on startup (no API call); the auditStorage PV provisions and mounts
/openbao/auditper chart defaults. The DB rotation block already guardedbao read database/static-roles/fleetto skip re-creation on existing static roles, so the new default only applies to fresh clusters / explicit role re-creation.Change history on this PR
bao audit enableruntime block at/vault/audit/audit.log/vault/...→/openbao/...(chart default)🤖 Generated with Claude Code