Skip to content

fix(cilium): keep spire-server off the Flux-controller node (soft anti-affinity)#1660

Closed
devantler wants to merge 1 commit into
mainfrom
fix/spire-server-antiaffinity
Closed

fix(cilium): keep spire-server off the Flux-controller node (soft anti-affinity)#1660
devantler wants to merge 1 commit into
mainfrom
fix/spire-server-antiaffinity

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Problem

spire-server is a single replica and the cluster's identity root. If its node fails, every spire-agent loses its upstream:

create attestation client: failed to dial dns:///spire-server:8081 ... dial tcp 10.96.193.18:8081: i/o timeout

…all six spire-agent pods crash-loop and Cilium mutual auth degrades cluster-wide.

During the 2026-05-28/29 incident, spire-server shared prod-worker-2 with kustomize-controller. When that node's Cilium ClusterIP datapath degraded after an OOMKill, workload identity and GitOps reconciliation went down together — and reconciliation was exactly what was needed to apply the fix (#1649). One node loss took out two critical subsystems at once, which is what made recovery a deadlock.

Fix

Add a soft (preferredDuringSchedulingIgnoredDuringExecution, weight 100) podAntiAffinity so spire-server prefers a worker that is not running the Flux controllers (app.kubernetes.io/part-of=flux), decorrelating the identity SPOF from the GitOps control plane.

Soft, so the single replica always schedules even when every node hosts a Flux pod — never risks leaving the cluster with no identity. Pairs with #1659 (which spreads the Flux controllers): together they push identity and reconciliation onto different workers.

Validation

  • Confirmed the Cilium 1.19.4 chart renders authentication.mutual.spire.install.server.affinity into the spire-server StatefulSet (templates/spire/server/statefulset.yaml line 114), so the value takes effect (not a silent no-op).
  • kubectl kustomize of the base cilium dir renders the podAntiAffinity; the docker controllers overlay still builds with spire.enabled: false (prod-only, inert locally — no merge conflict).
  • Both k8s/clusters/local/ and k8s/clusters/prod/ build.

Scope

Preventative (decorrelation / blast-radius reduction), placed in the base alongside the SPIRE server config added in #1649. Does not resolve the active outage on its own — that needs prod-worker-2's datapath rebuilt so reconciliation recovers.

…i-affinity)

spire-server is a single replica and the cluster's identity root: if its
node fails, every spire-agent loses its upstream (spire-server ClusterIP
-> i/o timeout) and Cilium mutual auth degrades cluster-wide.

On 2026-05-28 spire-server shared prod-worker-2 with kustomize-controller;
when that node's Cilium ClusterIP datapath degraded after an OOMKill,
workload identity AND GitOps reconciliation went down together — and
reconciliation was exactly what was needed to apply the fix, so the
cluster could not self-heal.

Add a soft (preferred) podAntiAffinity so spire-server prefers a worker
without app.kubernetes.io/part-of=flux pods, decorrelating the identity
SPOF from the GitOps controllers. Soft so the single replica always
schedules even when every node hosts a Flux pod. Verified the Cilium
1.19.4 chart renders authentication.mutual.spire.install.server.affinity
into the StatefulSet. SPIRE is disabled in the Docker overlay, so this is
prod-only and inert for local/CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a soft podAntiAffinity to the SPIRE server (deployed via Cilium's mutual auth chart values) so it prefers nodes not running Flux controllers, decorrelating the workload-identity SPOF from the GitOps control plane.

Changes:

  • Inject affinity.podAntiAffinity (preferred, weight 100) on authentication.mutual.spire.install.server keyed on app.kubernetes.io/part-of=flux.

@devantler devantler marked this pull request as ready for review May 29, 2026 13:56
@devantler devantler added this pull request to the merge queue May 29, 2026
@devantler devantler removed this pull request from the merge queue due to a manual request May 29, 2026
@devantler devantler closed this May 29, 2026
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 29, 2026
@devantler devantler deleted the fix/spire-server-antiaffinity branch May 29, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants