OCPBUGS-86719: Use zero-downtime rollout strategy for console pods#1168
OCPBUGS-86719: Use zero-downtime rollout strategy for console pods#1168asadawar wants to merge 1 commit into
Conversation
When a console plugin is installed or removed, the console deployment rolls out new pods. With the previous strategy (maxSurge=3, maxUnavailable=1), the rollout controller was allowed to terminate an old pod before its replacement was ready, causing a brief period of reduced availability visible as console flapping. Change the HA rollout strategy to maxSurge=1, maxUnavailable=0 on 3+ node topologies (HighlyAvailable, External+HA). This ensures no old pod is terminated until its replacement passes readiness checks, maintaining full capacity throughout the rollout. On 2-node topologies (DualReplica, HighlyAvailableArbiter), the required pod anti-affinity prevents scheduling a surge pod when every eligible node already runs a console pod. These topologies keep maxUnavailable=1 to avoid rollout deadlock, with maxSurge reduced from 3 to 1. Bug: https://issues.redhat.com/browse/OCPBUGS-86719 Assisted-by: Claude Code
|
@asadawar: This pull request references Jira Issue OCPBUGS-86719, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: asadawar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (5)**/*.go📄 CodeRabbit inference engine (AGENTS.md)
Files:
⚙️ CodeRabbit configuration file
Files:
{pkg,cmd}/**/*.go📄 CodeRabbit inference engine (CLAUDE.md)
Files:
pkg/console/subresource/**/*.go📄 CodeRabbit inference engine (ARCHITECTURE.md)
Files:
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}⚙️ CodeRabbit configuration file
Files:
**/*_test.go📄 CodeRabbit inference engine (AGENTS.md)
Files:
⚙️ CodeRabbit configuration file
Files:
🔇 Additional comments (5)
WalkthroughThe PR updates console deployment rolling update strategy from a fixed configuration ( ChangesTopology-aware rolling update strategy
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @asadawar. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
maxSurge=3, maxUnavailable=1tomaxSurge=1, maxUnavailable=0on 3+ node topologies (HighlyAvailable, External+HA), ensuring no old pod is terminated until its replacement passes readiness checksmaxUnavailable=1withmaxSurgereduced from 3 to 1 to avoid rollout deadlock caused by required pod anti-affinityWhy this approach
Three approaches were considered:
1.
maxUnavailable=0for all HA topologies (rejected)On DualReplica (2 masters, 2 replicas) and HighlyAvailableArbiter (2 full masters + 1 arbiter) clusters, the console deployment uses
RequiredDuringSchedulingIgnoredDuringExecutionpod anti-affinity onkubernetes.io/hostname. When every eligible node already runs a console pod, the scheduler cannot place a surge pod. WithmaxUnavailable=0, no old pod can be terminated either, causing a rollout deadlock that stalls untilProgressDeadlineExceeded(10 minutes). This approach was rejected because it would break recently added DualReplica support (PR #1151, merged 2026-05-07).2. Keep
maxUnavailable=1for all topologies, only reducemaxSurge(rejected)Reducing
maxSurgefrom 3 to 1 aligns with other operators (CMO monitoring-plugin usesmaxUnavailable=1with defaultmaxSurge) but does not fix the reported bug. WithmaxUnavailable=1, Kubernetes is still allowed to terminate one old pod before its replacement is ready, causing the console flap. This approach was rejected because it does not address the root cause.3. Topology-aware strategy (chosen)
Use
maxUnavailable=0on topologies where a free node is available for the surge pod (HighlyAvailable with 3+ masters, External+HA with multiple workers), andmaxUnavailable=1on constrained topologies (DualReplica, HighlyAvailableArbiter) where rollout deadlock is possible. This fixes the bug for the most common topology while preserving correct behavior on constrained clusters.For the HighlyAvailableArbiter case, the conservative choice (
maxUnavailable=1) was made because arbiter nodes may have taints or resource constraints that prevent scheduling console pods, effectively making it a 2-node topology for console scheduling. Maintainers familiar with arbiter node scheduling can adjust this if arbiter nodes are known to be eligible.Root cause
The
withStrategyfunction inpkg/console/subresource/deployment/deployment.go:184setmaxSurge=3, maxUnavailable=1for all HA topologies. These values were introduced in PR #1107 (OCPBUGS-74872) as part of a refactor that moved deployment construction from bindata to Go code, without specific rationale for the strategy values.With
maxUnavailable=1and 2 replicas, the Kubernetes deployment controller is allowed to terminate one old pod immediately when a rollout starts, even before any new pod is ready. This creates a window (approximately 10-15 seconds based on observed pod startup times) where only one pod serves traffic. During this window:Cluster verification
Verified on a live OCP 4.22.0-rc.4 vSphere IPI cluster:
Cluster topology:
Current strategy (before fix):
Pod distribution (2 pods on 2 of 3 masters, 3rd master free for surge):
With the fix applied (
maxSurge=1, maxUnavailable=0), the rollout behavior would be:At no point does availability drop below 2 (full capacity).
Test plan
make test-unit): all deployment strategy tests updated and passinggofmtandgovetclean (make check)OWNERS
/cc @spadgett @jhadvig @TheRealJon
Bug: https://issues.redhat.com/browse/OCPBUGS-86719
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests