Skip to content

feat(cluster-policies): require workloads to spread across nodes#1661

Open
devantler wants to merge 2 commits into
mainfrom
claude/friendly-poincare-67f404
Open

feat(cluster-policies): require workloads to spread across nodes#1661
devantler wants to merge 2 commits into
mainfrom
claude/friendly-poincare-67f404

Conversation

@devantler
Copy link
Copy Markdown
Contributor

What

The spread-pods ClusterPolicy is synced verbatim from upstream kyverno/policies by .github/workflows/sync-cluster-policies.yaml and adapted for this cluster via a Kustomize patch in k8s/bases/infrastructure/cluster-policies/kustomization.yaml. This change hardens that patch so spreading workloads across nodes is strictly required rather than best-effort:

  • whenUnsatisfiable: ScheduleAnywayDoNotSchedule — every Deployment/StatefulSet (outside kube-system, carrying an app.kubernetes.io/name label) must spread across nodes (topologyKey: kubernetes.io/hostname, maxSkew: 1).
  • Added matchLabelKeys: [pod-template-hash].

Why

DoNotSchedule + maxSkew: 1 on its own deadlocks rolling updates: during a rollout the new revision's surge pod is counted against the old revision's pods, so it can't be placed. matchLabelKeys: [pod-template-hash] scopes the skew calculation per ReplicaSet revision, keeping rollouts flowing while enforcement stays strict.

  • Requires Kubernetes ≥ 1.27 (GA in 1.34). This cluster runs k8s ~1.34 via Talos v1.12.4, so it is on by default.
  • StatefulSets have no pod-template-hash label; Kubernetes ignores absent keys and falls back to the label selector — fine, since StatefulSet updates roll one pod at a time.

Reviewer notes

  • The change is confined to the Kustomize patch; the upstream-synced sample (samples/other/spread-pods-across-topology/) is untouched, so it keeps receiving upstream updates for free.
  • Behavioral impact: pods can now go Pending if they genuinely cannot spread (e.g. a worker is drained/down on the 3-worker local cluster and a multi-replica app needs a distinct node). Single-replica apps are unaffected; rollouts are protected by matchLabelKeys.
  • Workloads without an app.kubernetes.io/name label are skipped (the policy needs it to group replicas) — unchanged behavior.
  • Companion insert-pod-antiaffinity policy left soft (preferred) — topology spread is now the hard enforcer.

Validation

  • ksail workload validate — local + prod, 256 files each, exit 0
  • kubectl kustomize k8s/clusters/local/ and k8s/clusters/prod/ build cleanly

🤖 Generated with Claude Code

The spread-pods ClusterPolicy (synced from upstream kyverno/policies and patched in the cluster-policies base) previously mutated workloads with a soft topology spread constraint (whenUnsatisfiable: ScheduleAnyway). Harden it to DoNotSchedule so spreading every Deployment/StatefulSet across nodes (topologyKey: kubernetes.io/hostname, maxSkew: 1) is strictly required rather than best-effort.

Also add matchLabelKeys: [pod-template-hash] so per-node skew is computed per ReplicaSet revision; without it, DoNotSchedule + maxSkew:1 deadlocks rolling updates because the new revision's surge pod counts against the old revision. Requires k8s >=1.27 (GA in 1.34); the cluster runs k8s ~1.34 via Talos v1.12.4. StatefulSets have no pod-template-hash, so the key is ignored there and falls back to the label selector.

Change is confined to the kustomize patch; the synced upstream sample is untouched so it keeps receiving updates. Validated with 'ksail workload validate' (local + prod, 256 files each) and kubectl kustomize builds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 14:39
@devantler devantler marked this pull request as ready for review May 29, 2026 14:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the local Kustomize patch over the upstream-synced spread-pods ClusterPolicy so workloads are required (not merely encouraged) to spread across nodes, while keeping rollouts unblocked via per-revision skew scoping.

Changes:

  • Switch injected topologySpreadConstraints[0].whenUnsatisfiable from ScheduleAnyway to DoNotSchedule.
  • Add matchLabelKeys: [pod-template-hash] so rolling updates don't deadlock under the strict skew rule.
  • Update the section header comment to reflect the new enforcement mode.

@devantler
Copy link
Copy Markdown
Contributor Author

🧪 System Test failure — analysis: pre-existing environmental flake, not caused by this change

TL;DR: The System Test failure is unrelated to this PR's topology-spread change. It was tripped by the strict FAIL_ON_WARNING "new event warnings" gate catching a single app cold-start readiness warning, and the same System Test is currently red across several unrelated merge-queue PRs in the same window.

What actually failed the build (job log):

##[error]1 distinct Warning event(s) still firing at steady state (by reason, ×=occurrences): Unhealthy ×2.
New Warning events since marker (1):
  [actual-budget] Pod/actual-budget-...  Unhealthy (x2): Readiness probe failed:
  Get "http://10.244.2.105:5006/": connect: connection refused

The Check for new event warnings step (FAIL_ON_WARNING: true) failed because actual-budget — a KEDA scale-to-zero app — was still cold-starting (process not yet listening on :5006) when the post-deploy warning sample was taken. "connection refused" is an app-readiness timing issue, not a scheduling one.

Why it isn't this change:

Action: re-running the System Test. No manifest change is warranted (per the repo convention of not pushing code for infra-only failures). If actual-budget's cold-start keeps tripping the steady-state warning gate, that's a separate pre-existing item (its readiness/startup timing vs. the gate's sample window) and can be tracked on its own.

🤖 Generated with Claude Code

@botantler botantler Bot added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants