Bug 86073: validate duplicate vSphere failureDomain topology#10563
Bug 86073: validate duplicate vSphere failureDomain topology#10563chdeshpa-hue wants to merge 2 commits into
Conversation
Adds a check in validateFailureDomains that detects when two or more failure domains have identical topology (same server, datacenter, computeCluster, datastore, networks, and resourcePool). Copy-pasted failure domains that differ only in name/region/zone labels provide no additional fault tolerance and can cause subtle scheduling issues. Inventory paths (computeCluster, datastore, resourcePool) are canonicalized with filepath.Clean before comparison so that syntactically different but semantically equivalent paths (e.g. trailing slashes, double separators) are normalized. Bug: https://redhat.atlassian.net/browse/OCPBUGS-86074 Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @chdeshpa-hue. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
WalkthroughAdds canonicalization for vSphere failure-domain topology (paths cleaned, networks sorted) and validation that rejects multiple failure domains with identical resulting topology; tests updated to assert duplicate-topology is invalid and to use unique resource-pool paths in tests. ChangesFailure domain duplicate topology detection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/types/vsphere/validation/platform.go (1)
361-366: ⚖️ Poor tradeoffConsider using
path.Cleaninstead offilepath.Cleanfor vSphere paths.vSphere inventory paths always use forward slashes (URL-style), but
filepath.Cleanuses OS-specific path separators (backslashes on Windows). Thepath.Cleanfunction from the"path"package is designed for slash-separated paths and would be more semantically correct.That said, this pattern already exists throughout this file (lines 262, 317, 336, 346), so changing it here alone wouldn't be consistent. This is noted for potential future refactoring across the file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/types/vsphere/validation/platform.go` around lines 361 - 366, The normalizePath closure uses filepath.Clean which applies OS-specific separators; replace filepath.Clean with path.Clean (from the "path" package) so vSphere inventory paths (slash-separated) are normalized correctly; update imports to include "path" if missing and consider aligning the same change for the other occurrences of filepath.Clean in this file (e.g., the similar closures/uses around the functions referenced at lines near the existing normalizePath instances).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/types/vsphere/validation/platform_test.go`:
- Around line 555-564: Update the duplicate-topology detection so HostGroup
failure domains are allowed to share all topology fields except
Topology.HostGroup: modify the topology-key construction function
vsphereFailureDomainTopologyKey (and any code that uses it to detect duplicates)
to include Topology.HostGroup when the failure domain ZoneType is
HostGroupFailureDomain (so keys differ for different HostGroup values), and add
the new test "Valid HostGroup failure domains with same topology but different
HostGroups" to pkg/types/vsphere/validation/platform_test.go to assert no error
is produced.
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 353-375: The topology key builder vsphereFailureDomainTopologyKey
omits Topology.HostGroup so HostGroup-based failure domains can be treated as
duplicates; update vsphereFailureDomainTopologyKey to include
fd.Topology.HostGroup (normalized like other path-like fields, e.g., via
normalizePath or directly) into the fmt.Sprintf key (e.g., add a hostGroup=%s
segment) and ensure networks sorting/copying remains unchanged; also update the
related validation/error message that lists topology fields (the one referencing
hostGroup currently missing) to include hostGroup so error text correctly
reflects the fields compared for duplicate detection.
---
Nitpick comments:
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 361-366: The normalizePath closure uses filepath.Clean which
applies OS-specific separators; replace filepath.Clean with path.Clean (from the
"path" package) so vSphere inventory paths (slash-separated) are normalized
correctly; update imports to include "path" if missing and consider aligning the
same change for the other occurrences of filepath.Clean in this file (e.g., the
similar closures/uses around the functions referenced at lines near the existing
normalizePath instances).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 54606e31-2d37-4340-adb4-6bb0b7b87799
📒 Files selected for processing (2)
pkg/types/vsphere/validation/platform.gopkg/types/vsphere/validation/platform_test.go
| // vsphereFailureDomainTopologyKey builds a comparable key from the infrastructure-defining | ||
| // fields of a failure domain topology. Inventory paths are canonicalized with filepath.Clean | ||
| // so that syntactically different but equivalent paths (e.g. trailing slashes) are normalized. | ||
| func vsphereFailureDomainTopologyKey(fd vsphere.FailureDomain) string { |
There was a problem hiding this comment.
This will absolutely break vm-host zonal
I am not entirely excited with this function
| if p == "" { | ||
| return "" | ||
| } | ||
| return filepath.Clean(p) |
There was a problem hiding this comment.
this is wrong, paths within govc/govmomi are not file paths, use path.Clean()
| return filepath.Clean(p) | ||
| } | ||
|
|
||
| return fmt.Sprintf("server=%s;dc=%s;cluster=%s;ds=%s;nets=%s;rp=%s", |
There was a problem hiding this comment.
There has to be a better way to determine if failure domain / topology is colliding.
There was a problem hiding this comment.
This was claude's review
Feedback on vsphereFailureDomainTopologyKey
The string-key approach with fmt.Sprintf("server=%s;dc=%s;cluster=%s;...") is fragile:
- Relies on field values never containing the delimiters (
;,=,,) - Must be manually kept in sync if
Topologyfields change - Ambiguous network join —
"net-a,b" + "net-c"vs"net-a" + "b,net-c"produce the same comma-joined string
Suggested approach: use a comparable struct as the map key
Go supports struct map keys as long as all fields are comparable. Define a small struct with only the fields that define physical topology, and use it directly:
type failureDomainTopology struct {
server string
datacenter string
computeCluster string
datastore string
networks string
resourcePool string
}
func normalizedTopology(fd vsphere.FailureDomain) failureDomainTopology {
networks := make([]string, len(fd.Topology.Networks))
copy(networks, fd.Topology.Networks)
sort.Strings(networks)
return failureDomainTopology{
server: fd.Server,
datacenter: fd.Topology.Datacenter,
computeCluster: filepath.Clean(fd.Topology.ComputeCluster),
datastore: filepath.Clean(fd.Topology.Datastore),
networks: strings.Join(networks, "\x00"),
resourcePool: filepath.Clean(fd.Topology.ResourcePool),
}
}Then the validation simplifies to:
fdTopologies := make(map[failureDomainTopology]string)
// in the loop:
topo := normalizedTopology(failureDomain)
if prevName, exists := fdTopologies[topo]; exists {
allErrs = append(allErrs, field.Invalid(...))
} else {
fdTopologies[topo] = failureDomain.Name
}Why not reuse FailureDomain or Topology directly? Both contain fields we want to exclude from comparison. FailureDomain has identity/label fields (Name, Region, Zone, RegionType, ZoneType). Topology has fields like Folder, Template, TagIDs, HostGroup that aren't relevant to the duplicate-infrastructure check. A dedicated struct makes the "these are the fields that define physical topology" decision explicit and reviewable — and it won't silently include new fields added later.
Update: I think Claude is wrong here, if we are installing vm-host zonal, hostgroup would need to be checked.
Benefits over the string key:
- Type-safe — no delimiter injection risk
- Compiler-checked — field additions are obvious at the struct definition, not buried in a format string
- Uses
\x00as network separator — can't appear in vSphere inventory paths, eliminating join ambiguity
|
Thanks for the review @jcpowermac — all valid points. Here's how I'll address them: CI failures: Investigated both — they're infra flakes unrelated to this change:
Will fix in V2:
Will push V2 shortly. |
- Refactor from fmt.Sprintf string key to comparable struct map key (type-safe, no delimiter injection risk, compiler-checked) - Include Topology.HostGroup in comparison so vm-host zonal failure domains with different HostGroups are not falsely rejected - Switch from filepath.Clean to path.Clean for vSphere inventory paths (URL-style, always forward slashes — not OS paths) - Use \x00 as network separator to eliminate join ambiguity - Add positive test: HostGroup FDs with same topology but different HostGroups must pass validation - Update error message to include hostGroup in the list of compared fields Bug: https://redhat.atlassian.net/browse/OCPBUGS-86073 Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 389-397: The current vsphereTopologyKey construction always sets
hostGroup from fd.Topology.HostGroup which lets non-HostGroup failure domains
evade duplicate-topology detection; modify the return so hostGroup is populated
only when the failure domain is a HostGroup zone (e.g., check fd.Type or
equivalent type field for the HostGroup enum/value) and otherwise set hostGroup
to an empty string (or nil-equivalent). Update the code around the
vsphereTopologyKey return (referencing vsphereTopologyKey, hostGroup,
fd.Topology.HostGroup, and fd.Type) so only HostGroup-type failure domains
contribute their HostGroup into the dedupe key.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5cb61b55-abb2-4163-9e76-e90b0be207b1
📒 Files selected for processing (2)
pkg/types/vsphere/validation/platform.gopkg/types/vsphere/validation/platform_test.go
| return vsphereTopologyKey{ | ||
| server: fd.Server, | ||
| datacenter: fd.Topology.Datacenter, | ||
| computeCluster: normalizePath(fd.Topology.ComputeCluster), | ||
| datastore: normalizePath(fd.Topology.Datastore), | ||
| networks: strings.Join(networks, "\x00"), | ||
| resourcePool: normalizePath(fd.Topology.ResourcePool), | ||
| hostGroup: fd.Topology.HostGroup, | ||
| } |
There was a problem hiding this comment.
Gate hostGroup participation in dedupe to HostGroup zones only.
As written, non-HostGroup failure domains can evade duplicate-topology detection by setting different Topology.HostGroup values, even though hostGroup is not topology-defining for those zone types.
🔧 Proposed fix
func normalizedTopologyKey(fd vsphere.FailureDomain) vsphereTopologyKey {
networks := make([]string, len(fd.Topology.Networks))
copy(networks, fd.Topology.Networks)
sort.Strings(networks)
@@
+ hostGroup := ""
+ if fd.ZoneType == vsphere.HostGroupFailureDomain {
+ hostGroup = fd.Topology.HostGroup
+ }
+
return vsphereTopologyKey{
server: fd.Server,
datacenter: fd.Topology.Datacenter,
computeCluster: normalizePath(fd.Topology.ComputeCluster),
datastore: normalizePath(fd.Topology.Datastore),
networks: strings.Join(networks, "\x00"),
resourcePool: normalizePath(fd.Topology.ResourcePool),
- hostGroup: fd.Topology.HostGroup,
+ hostGroup: hostGroup,
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return vsphereTopologyKey{ | |
| server: fd.Server, | |
| datacenter: fd.Topology.Datacenter, | |
| computeCluster: normalizePath(fd.Topology.ComputeCluster), | |
| datastore: normalizePath(fd.Topology.Datastore), | |
| networks: strings.Join(networks, "\x00"), | |
| resourcePool: normalizePath(fd.Topology.ResourcePool), | |
| hostGroup: fd.Topology.HostGroup, | |
| } | |
| hostGroup := "" | |
| if fd.ZoneType == vsphere.HostGroupFailureDomain { | |
| hostGroup = fd.Topology.HostGroup | |
| } | |
| return vsphereTopologyKey{ | |
| server: fd.Server, | |
| datacenter: fd.Topology.Datacenter, | |
| computeCluster: normalizePath(fd.Topology.ComputeCluster), | |
| datastore: normalizePath(fd.Topology.Datastore), | |
| networks: strings.Join(networks, "\x00"), | |
| resourcePool: normalizePath(fd.Topology.ResourcePool), | |
| hostGroup: hostGroup, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/types/vsphere/validation/platform.go` around lines 389 - 397, The current
vsphereTopologyKey construction always sets hostGroup from fd.Topology.HostGroup
which lets non-HostGroup failure domains evade duplicate-topology detection;
modify the return so hostGroup is populated only when the failure domain is a
HostGroup zone (e.g., check fd.Type or equivalent type field for the HostGroup
enum/value) and otherwise set hostGroup to an empty string (or nil-equivalent).
Update the code around the vsphereTopologyKey return (referencing
vsphereTopologyKey, hostGroup, fd.Topology.HostGroup, and fd.Type) so only
HostGroup-type failure domains contribute their HostGroup into the dedupe key.
|
@chdeshpa-hue: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
validateFailureDomainsto detect when two or more vSphere failure domains have identical topology (same server, datacenter, computeCluster, datastore, networks, and resourcePool). Copy-pasted failure domains that differ only in name/region/zone labels provide no additional fault tolerance and can cause subtle scheduling issues.computeCluster,datastore,resourcePool) are canonicalized withfilepath.Cleanbefore comparison so that syntactically different but semantically equivalent paths (e.g. trailing slashes, double separators) are normalized.Split from #10561 per reviewer feedback — this PR contains the vSphere portion only.
Bug: https://redhat.atlassian.net/browse/OCPBUGS-86073
Manual Test Results
Tested with a custom-built
openshift-installbinary against a live vSphere environment. When two failure domains share identical topology but different names, the installer now correctly rejects:Test Plan
go test ./pkg/types/vsphere/validation/Multi-zone platform failureDomain duplicate topologyResources in customized foldersupdated to use distinct ResourcePool paths/cc @jcpowermac
Made with Cursor
Summary by CodeRabbit