fix: OCPBUGS-86073: validate duplicate failureDomain names and topology (Nutanix, vSphere)#10561
Conversation
…gy (Nutanix, vSphere) The installer does not detect when multiple failure domains point to the same underlying infrastructure. A user can configure two failure domains with different names but identical topology (same Prism Element and subnets on Nutanix, or same server/datacenter/computeCluster/ datastore/networks/resourcePool on vSphere). The installer accepts this without any warning, giving users a false sense of zone-level fault tolerance when none exists. Additionally, Nutanix failure domain validation did not check for duplicate names, unlike vSphere which already had this check. This commit adds: - Nutanix: reject failure domains with duplicate names - Nutanix: reject failure domains with identical topology (same prismElement UUID and subnet UUIDs) - vSphere: reject failure domains with identical topology (same server, datacenter, computeCluster, datastore, networks, and resourcePool) All three scenarios were confirmed on live 4.22 RC3 clusters before this fix. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@chdeshpa-hue: This pull request references Jira Issue OCPBUGS-86073, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR extends platform validation to detect and reject failure domains with duplicate topology. Nutanix and vSphere implementations independently compute stable topology keys from infrastructure-defining fields and reject configurations where multiple failure domains share identical topology, since they provide no additional fault tolerance. Tests verify duplicate-name and duplicate-topology rejection. ChangesFailure-domain topology deduplication
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @chdeshpa-hue. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 176-182: The duplicate-topology check uses raw strings so
equivalent vSphere inventory paths can slip through; modify
vsphereFailureDomainTopologyKey (and the other occurrence around lines 356-367)
to canonicalize computeCluster, datastore, resourcePool (and any other inventory
path fields used to build topoKey) before composing the topoKey: add or call a
helper like canonicalizeVSpherePath(path string) that normalizes
trailing/leading slashes, resolves redundant segments, and lowercases or
otherwise normalizes case where appropriate, then use those canonicalized values
when populating fdTopologies and when comparing prevName; update references to
fdTopologies[topoKey] and the key construction in both places
(vsphereFailureDomainTopologyKey and its duplicate) so comparisons use the
canonical form.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c8b9cce1-2c49-45ff-b4fb-b457a2b859b4
📒 Files selected for processing (4)
pkg/types/nutanix/validation/platform.gopkg/types/nutanix/validation/platform_test.gopkg/types/vsphere/validation/platform.gopkg/types/vsphere/validation/platform_test.go
| topoKey := vsphereFailureDomainTopologyKey(failureDomain) | ||
| if prevName, exists := fdTopologies[topoKey]; exists { | ||
| allErrs = append(allErrs, field.Invalid(fldPath.Index(index), failureDomain.Name, | ||
| fmt.Sprintf("failure domain %q has identical topology (same server, datacenter, computeCluster, datastore, networks, resourcePool) as %q; this provides no additional fault tolerance", failureDomain.Name, prevName))) | ||
| } else { | ||
| fdTopologies[topoKey] = failureDomain.Name | ||
| } |
There was a problem hiding this comment.
Canonicalize topology paths before duplicate-key comparison.
Duplicate-topology detection uses raw computeCluster / datastore / resourcePool strings, so syntactically different-but-equivalent paths can bypass this check.
Proposed fix
func vsphereFailureDomainTopologyKey(fd vsphere.FailureDomain) string {
networks := make([]string, len(fd.Topology.Networks))
copy(networks, fd.Topology.Networks)
sort.Strings(networks)
+
+ normalizePath := func(p string) string {
+ if p == "" {
+ return ""
+ }
+ return filepath.Clean(p)
+ }
+
return fmt.Sprintf("server=%s;dc=%s;cluster=%s;ds=%s;nets=%s;rp=%s",
fd.Server,
fd.Topology.Datacenter,
- fd.Topology.ComputeCluster,
- fd.Topology.Datastore,
+ normalizePath(fd.Topology.ComputeCluster),
+ normalizePath(fd.Topology.Datastore),
strings.Join(networks, ","),
- fd.Topology.ResourcePool)
+ normalizePath(fd.Topology.ResourcePool))
}Also applies to: 356-367
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/types/vsphere/validation/platform.go` around lines 176 - 182, The
duplicate-topology check uses raw strings so equivalent vSphere inventory paths
can slip through; modify vsphereFailureDomainTopologyKey (and the other
occurrence around lines 356-367) to canonicalize computeCluster, datastore,
resourcePool (and any other inventory path fields used to build topoKey) before
composing the topoKey: add or call a helper like canonicalizeVSpherePath(path
string) that normalizes trailing/leading slashes, resolves redundant segments,
and lowercases or otherwise normalizes case where appropriate, then use those
canonicalized values when populating fdTopologies and when comparing prevName;
update references to fdTopologies[topoKey] and the key construction in both
places (vsphereFailureDomainTopologyKey and its duplicate) so comparisons use
the canonical form.
|
@chdeshpa-hue can we split this into two bugs and prs? |
|
@jcpowermac will do |
|
Closing in favor of platform-specific PRs per reviewer feedback:
Each PR is now independently reviewable and tracks its own Jira bug. |
|
@chdeshpa-hue: This pull request references Jira Issue OCPBUGS-86073. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
Fixes OCPBUGS-86073
The installer does not detect when multiple failure domains point to the same underlying infrastructure. This gives users a false sense of zone-level fault tolerance when none exists.
Three gaps were identified and fixed:
This is especially impactful on Nutanix where failure domains are optional and entirely manual (copy-paste YAML is the only configuration method).
Changes
pkg/types/nutanix/validation/platform.go: Added duplicate name check and duplicate topology check (comparing Prism Element UUID + subnet UUIDs) across failure domains.pkg/types/vsphere/validation/platform.go: Added duplicate topology check (comparing server, datacenter, computeCluster, datastore, networks, resourcePool) across failure domains.Manual Test Results (4.22 RC3 custom binary)
Test 1 — vSphere: duplicate topology, different names:
Test 2 — vSphere: duplicate topology + duplicate name:
Test 3 — Nutanix: duplicate topology (same PE + subnet):
Test Plan
Made with Cursor
Summary by CodeRabbit
Release Notes
New Features
Tests