Skip to content

feat(openbao): enable file audit device, shorten DB role rotation to 7 d#1626

Open
devantler wants to merge 9 commits into
mainfrom
feat/security-openbao-audit-and-rotation
Open

feat(openbao): enable file audit device, shorten DB role rotation to 7 d#1626
devantler wants to merge 9 commits into
mainfrom
feat/security-openbao-audit-and-rotation

Conversation

@devantler
Copy link
Copy Markdown
Contributor

@devantler devantler commented May 28, 2026

🤖 Generated by the Daily AI Assistant

Part of the security-hardening series (Phase 3.4, audit + rotation portion).

What

Three related changes to OpenBao's audit + DB engine setup.

1. Declare the file audit device in HCL config

The chart provisions an auditStorage PVC (mounted at /openbao/audit per chart defaults) but had no audit device configured to write to it.

OpenBao explicitly blocks runtime audit-device enables via the API:

* cannot enable audit device via API; use declarative,
  config-based audit device management instead

So the device is declared in openbao/helm-release.yaml standalone.config HCL alongside listener and storage:

audit "file" {
  file_path = "/openbao/audit/audit.log"
}

Now every OpenBao API call writes one JSON record per request to /openbao/audit/audit.log. Tail from the openbao pod today; ship via promtail once the observability stack lands.

2. Bump auditStorage to 10Gi + document fail-closed mode

OpenBao's file audit backend does not rotate, and the server fails closed on audit-write errors — every API request blocks once the volume is full. The chart default of 1Gi was inadequate.

auditStorage:
  enabled: true
  size: ${openbao_audit_storage_size:=10Gi}   # was 1Gi

10Gi gives multi-year headroom at this cluster's ~700 KB/day traffic. The inline comment documents the failure mode and a manual SIGHUP-based rotation procedure for use until promtail ships the stream off-PVC (planned in the observability rollout).

3. Shorten fleetdm MySQL static-role rotation: 90 d → 7 d

- rotation_period="${fleetdm_mysql_rotation_period:=2160h}"   # 90 d
+ rotation_period="${fleetdm_mysql_rotation_period:=168h}"    # 7 d

Operator override via the fleetdm_mysql_rotation_period cluster variable is preserved; only the default shifts.

4. Doc-only: refresh the top-of-file step listing in vault-config Job

Headers + step numbers updated to match the now-current shape (the audit-enable step is gone, leaving sections 1–7; what was section 8 is now 7).

Out of scope (deferred to a separate PR)

  • Audit log rotation CronJob. Tactical mitigation is the bigger PVC + the SIGHUP rotate procedure in the inline comment. Proper rotation + retention belongs with the observability rollout that ships the stream to Loki anyway.
  • Transit engine for app-layer encryption. No consumer in flight to drive the API surface.

Validation

$ ksail workload validate                            → 256 files validated
$ ksail --config ksail.prod.yaml workload validate   → 256 files validated

The HCL audit device is parsed by OpenBao on startup (no API call); the auditStorage PV provisions and mounts /openbao/audit per chart defaults. The DB rotation block already guarded bao read database/static-roles/fleet to skip re-creation on existing static roles, so the new default only applies to fresh clusters / explicit role re-creation.

Change history on this PR

Commit Why
Initial Added bao audit enable runtime block at /vault/audit/audit.log
1ade5f5f Fixed path: /vault/.../openbao/... (chart default)
685bbfe2 Moved to HCL declarative config (OpenBao blocks runtime audit enables)
35f0210a Bumped audit PVC to 10Gi + documented fail-closed mode

🤖 Generated with Claude Code

Two related OpenBao hardenings.

1. Enable the file audit device on the auditStorage PV.

   The OpenBao chart provisions `auditStorage.enabled: true` (mounted at
   /vault/audit per chart defaults), but the vault-config Job never ran
   `bao audit enable` -- so the PV exists, is mounted, and is empty.
   Every read of a Secret to date has been unrecorded.

   Adds an idempotent block (section 7 in the Job) that runs:
     bao audit enable file file_path=/vault/audit/audit.log
   guarded by a `bao audit list | grep -q file/` check so re-runs don't
   error. Now every OpenBao API call writes one JSON record per request.
   The audit log can be tailed from the openbao pod today, and shipped
   to Loki by a sidecar / promtail once the observability stack lands
   (per the observability-production-ready memory).

   Note: OpenBao blocks all writes if its audit log path is unwritable
   (HashiCorp Vault behavioural compatibility). The PV mount handles
   this in practice; the chart-managed PV is bound for the lifetime of
   the StatefulSet.

2. Shorten the fleetdm MySQL static-role rotation default from 90 d to 7 d.

   The 2160h default was chosen during initial bootstrap as a low-churn
   value. With ESO + Reloader already validated to pick up rotations
   without disruption, 168h (7 d) is HashiCorp's documented sweet spot:
   long enough that ESO cache misses are rare, short enough that a
   leaked credential's half-life is bounded.

   Operator override is preserved via the fleetdm_mysql_rotation_period
   cluster variable; only the *default* shifts.

3. Doc-only: top-of-file step list grows to cover step 6 (OIDC), step 7
   (audit), step 8 (DB engine) -- the steps existed but the doc block
   was stale.

Validation:
  $ ksail workload validate                          → 255 files validated
  $ ksail --config ksail.prod.yaml workload validate → 255 files validated

Deferred from this PR (left for separate work):
  - Transit engine for app-layer encryption (no consumer in-flight to
    drive the design; would be premature).
  - Audit-log shipping to Loki (depends on the observability rollout).

Closes Phase 3.4 (partial — audit + rotation) of the public-repo
hardening series.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens OpenBao operations by enabling file-based audit logging and reducing the default FleetDM MySQL static-role rotation period.

Changes:

  • Adds a vault-config step to enable the OpenBao file audit device at /vault/audit/audit.log.
  • Updates the FleetDM MySQL static-role default rotation period from 90 days to 7 days.
  • Refreshes the top-level job comments to match the script’s current configuration steps.

@devantler devantler marked this pull request as ready for review May 28, 2026 11:16
@devantler devantler enabled auto-merge May 28, 2026 11:17
Copilot AI review requested due to automatic review settings May 28, 2026 17:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread k8s/bases/infrastructure/vault-config/job.yaml Outdated
The audit-enable block I added in the previous commit pointed at
/vault/audit/audit.log — the upstream HashiCorp Vault chart's
convention. The OpenBao chart instead standardised on /openbao/*
paths to match its rebrand (the data path is /openbao/data, the
auditStorage mount defaults to /openbao/audit).

The vault-config Job's bao CLI runs against the server pod, and the
file_path is resolved on the *server* pod's filesystem — so the wrong
path caused the audit-enable step to fail (no such directory), and the
Job hit CrashLoopBackOff because the script aborts on error. The Flux
Kustomization 'infrastructure' then could not reconcile because its
health check waited indefinitely for the Job to reach Complete.

CI log excerpt:
  openbao  vault-config-mcbbr  0/1  Error              5
  openbao  vault-config-q92kx  0/1  CrashLoopBackOff   10
  Kustomization/flux-system/infrastructure
    timeout waiting for: [Job/openbao/vault-config status: 'InProgress']

Fix: change file_path to /openbao/audit/audit.log. Comment block
explains the /openbao vs /vault path convention so the next operator
doesn't re-introduce the bug. The chart's auditStorage PV already
provisions and mounts the directory; no chart-level changes needed.

Reviewed-by: Copilot review on PR #1626 (it caught this before CI did).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on this branch was failing with:

  Error enabling audit device: Error making API request.
  URL: PUT http://openbao.openbao.svc.cluster.local:8200/v1/sys/audit/file
  Code: 400. Errors:
  * cannot enable audit device via API; use declarative, config-based
    audit device management instead

OpenBao does not allow enabling the audit device at runtime via the
sys/audit API -- it requires the device to be declared in the server's
HCL config alongside listener/storage. The vault-config Job's
'bao audit enable' call was therefore wrong by design and would never
have worked against this OpenBao build.

Fix:

1. openbao HelmRelease (standalone.config): add a declarative

     audit "file" {
       file_path = "/openbao/audit/audit.log"
     }

   stanza. /openbao/audit is the chart's auditStorage PV mount path
   (matches the /openbao/data data path). OpenBao reads this on
   startup; no API call needed. Every API request is logged to
   /openbao/audit/audit.log as one JSON record per line.

2. vault-config Job: drop the now-dead 'bao audit enable' block.
   Replace it with a comment explaining why this is declarative-only.
   Renumber the trailing 'Database secrets engine' section from
   8 -> 7 in both the body and the top-of-file step list.

The previous commit (1ade5f5) fixed the path from /vault to /openbao
based on the chart default; this commit moves the configuration to
the correct place (HCL config) so it actually takes effect.

Validation:
  $ ksail workload validate                          → 256 files validated
  $ ksail --config ksail.prod.yaml workload validate → 256 files validated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 20:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread k8s/bases/infrastructure/controllers/openbao/helm-release.yaml Outdated
Comment thread k8s/bases/infrastructure/vault-config/job.yaml Outdated
OpenBao's file audit backend does not rotate, and OpenBao fails CLOSED
on audit-write errors (every API request blocks once the volume is
full). The chart default of 1Gi would silently degrade to a fully
sealed cluster after a few months at this cluster's request volume.

Changes:

- auditStorage.size: 1Gi -> 10Gi.
  10Gi gives multi-year headroom for this cluster's traffic
  (~700 KB/day from current ESO + vault-snapshot use). Variable
  override matches the dataStorage idiom so fork operators can tune
  per-cluster.
- Inline comment documents:
  * the failure mode (fail-closed, blocks API);
  * the rotation strategy until the observability stack ships the
    audit stream off-PVC (a manual SIGHUP rotate from the openbao
    pod);
  * the metric to monitor while we're still file-backed.

This is a tactical sizing/documentation fix. Proper rotation +
shipping happens in the observability rollout (per the
observability-production-ready memory) -- promtail will consume
audit.log and the PVC sizing becomes irrelevant. Tracked as a
follow-up to this PR.

Validation:
  $ ksail workload validate                          → 256 files validated
  $ ksail --config ksail.prod.yaml workload validate → 256 files validated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenBao parsed the audit stanza I added in 685bbfe and rejected it:

  error loading configuration from /tmp/storageconfig.hcl:
  error parsing 'audit': audit.0: audit type must be specified

I had written it as if 'file' were the audit type:

  audit "file" {
    file_path = "/openbao/audit/audit.log"
  }

But per the OpenBao docs (https://openbao.org/docs/configuration/audit)
the label after 'audit' is an arbitrary identifier (it becomes the
device's path under /sys/audit/<label>) and the backend type goes in a
'type' field. Backend-specific settings live under 'options'. Correct
shape:

  audit "file" {
    type = "file"
    options = {
      file_path = "/openbao/audit/audit.log"
    }
  }

The label happens to still be "file" here; that's the conventional
name when you have only one file device, not a reserved keyword.

Inline comment block links the docs and explains the shape so the
next operator doesn't repeat the mistake.

Without this fix the openbao server would refuse to start (config
parse failure at boot), which then cascades:
  - vault-config Job's init container can't reach openbao;
  - every PushSecret / ExternalSecret stays in InProgress;
  - flux-system/infrastructure Kustomization health-check times out.

CI was failing for exactly this reason.

Validation:
  $ ksail workload validate                          → 256 files validated
  $ ksail --config ksail.prod.yaml workload validate → 256 files validated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 22:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread k8s/bases/infrastructure/vault-config/job.yaml Outdated
The NOTE block in vault-config/job.yaml still showed the old (broken)
HCL form:

  audit "file" { file_path = "/openbao/audit/audit.log" }

which the openbao server rejects with
  audit type must be specified

Updated the comment to show the correct shape that actually lives in
the HelmRelease and to link the OpenBao docs:

  audit "file" {
    type = "file"
    options = {
      file_path = "/openbao/audit/audit.log"
    }
  }

  See https://openbao.org/docs/configuration/audit

This prevents the next operator from copying the wrong syntax while
troubleshooting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f9dc086 fixed the missing 'type' field, but OpenBao then raised the
next parse error:

  error parsing 'audit': audit.0: audit path must be specified

OpenBao's audit block requires BOTH 'type' and 'path' at config-parse
time:

  audit "<id>" {
    type    = "<device>"
    path    = "<api mount path>"
    options = { ... }
  }

'path' is the OpenBao mount path used to address the device via
/sys/audit/<path>/ and what 'bao audit list' reports. We use "file/"
to match the conventional default of 'bao audit enable file'.

The misleading label-vs-type-vs-path triple is exactly the thing
Copilot caught in #3320939677, so updated the comments in both files
to spell out that 'type' and 'path' are required and the label is
just a stable identifier.

Validation:
  $ ksail workload validate                          → 256 files validated
  $ ksail --config ksail.prod.yaml workload validate → 256 files validated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 23:08
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants