rename to vsphere-passthrough-node-controller + add crash-fence controller#7
Conversation
…rash) Adds fence.py — a second, optional Deployment (fence.enabled, off by default) sharing this image and reusing the vCenter client + node↔VM map. Automates non-graceful node shutdown for passthrough-GPU workers that vSphere HA can't restart elsewhere during a host crash: applies the node.kubernetes.io/out-of-service taint to a node confirmed dead by BOTH k8s (NotReady) and vCenter (VM connectionState disconnected/inaccessible/ orphaned), sustained graceSeconds, so RWO volumes force-detach and pods reschedule; removes it on recovery (connected + Ready). Disjoint from the maintenance controller (clean power-off stays 'connected', only a crash goes 'disconnected') — no coordination needed. Taint/un-taint only; power-on stays with vSphere HA. Own SA + least-privilege ClusterRole (nodes only) + kill switch + dryRun. test_fence.py covers the two-gate guard, grace hold, no-double-apply, un-fence on recovery, and the partition/notfound guards (8/8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 53 minutes and 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe pull request adds a new optional crash-fence controller ( ChangesCrash-fence controller feature
Sequence Diagram(s)sequenceDiagram
participant FenceController
participant K8sClient
participant VSphereClient
FenceController->>K8sClient: list nodes (GPU label)
K8sClient-->>FenceController: node list
loop for each node
FenceController->>K8sClient: get node Ready status
K8sClient-->>FenceController: Ready state
FenceController->>VSphereClient: get_vm_connection_state
VSphereClient-->>FenceController: "connected" or "disconnected"/"inaccessible"/"orphaned"
alt both gates active + grace elapsed
FenceController->>K8sClient: apply_out_of_service_taint
K8sClient-->>FenceController: taint applied
else both gates inactive + taint present
FenceController->>K8sClient: remove_out_of_service_taint
K8sClient-->>FenceController: taint removed
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Line 10: The changelog declares "## [0.5.0]" but the reference links at the
bottom still treat 0.4.3 as the latest; add a "[0.5.0]:" link definition
pointing to the compare URL for v0.4.3...v0.5.0 (or the repo tag URL for v0.5.0)
and update the "[Unreleased]:" reference to compare against v0.5.0 (change its
right-hand/latest tag from v0.4.3 to v0.5.0); locate the "## [0.5.0]" header and
the link definitions named "[0.5.0]:" and "[Unreleased]:" to make these edits so
the release compare navigation is correct.
In `@chart/templates/fence.yaml`:
- Around line 27-30: The ClusterRole rules in fence.yaml grant both "patch" and
"update" on the "nodes" resource; remove the "update" verb and keep only "patch"
(and any existing verbs like "get","list","watch") to enforce least-privilege
for taint operations—locate the rules block (apiGroups: [""], resources:
["nodes"], verbs: [...]) and delete "update" from the verbs array.
- Line 5: The conditional in the template uses the redundant Go template "and"
operator with a single argument ({{- if and .Values.serviceAccount.create }});
change it to a direct boolean check by replacing the use of "and" with a simple
if on .Values.serviceAccount.create (i.e., use {{- if
.Values.serviceAccount.create }}), ensuring any matching {{ end }} remains
unchanged.
In `@controller.py`:
- Around line 484-488: has_out_of_service_taint currently matches taints by key
only which can erroneously detect/remove taints; update it and the related
functions apply_out_of_service_taint and remove_out_of_service_taint to match
the full taint identity (key, value, effect) — or at minimum key+effect —
instead of key-only. Locate uses of OUT_OF_SERVICE_TAINT_KEY and change the
predicate in has_out_of_service_taint to compare (t.key, t.value, t.effect) to
the controller's OUT_OF_SERVICE taint triple, ensure apply_out_of_service_taint
creates the taint with the same (key,value,effect) tuple, and change
remove_out_of_service_taint to filter/remove only taints matching that full
tuple so you don't touch taints the controller didn't create.
In `@README.md`:
- Around line 121-129: Update the README examples to reference the 0.5.0
chart/tag wherever examples show 0.4.3 so the documented fence.* fields (dryRun,
graceSeconds, pollSeconds) match the supported release, and update the
source-layout description that currently says “single controller.py” to reflect
the new split (mention both controller.py and fence.py) so readers know fence.py
exists; search for occurrences of the chart tag/version, the phrase "single
controller.py", and the fence.* example block to make the changes consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 8030bd06-6a49-48c2-bef3-c134b0815e0d
📒 Files selected for processing (10)
.gitignoreCHANGELOG.mdDockerfileREADME.mdchart/Chart.yamlchart/templates/fence.yamlchart/values.yamlcontroller.pyfence.pytest_fence.py
The product now covers two passthrough-node host-lifecycle concerns — planned maintenance and (new) crash fencing — so "maintenance-controller" no longer fits. Renames repo/image/chart/templates/helpers/docs to vsphere-passthrough-node-controller (image + chart at ghcr.io/varashi/[charts/]vsphere-passthrough-node-controller). GitHub repo renamed (redirects cover old git/web refs). The internal maintenance-state annotation domain (vsphere-maintenance.boeye.net/*) is left unchanged — it's controller state, not user-facing, and no nodes carry it currently. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Folded in the product rename gpu-node-vsphere-maintenance-controller → vsphere-passthrough-node-controller (GitHub repo renamed; redirects cover old refs). The product now spans two passthrough-node host-lifecycle concerns (planned maintenance + crash fencing), so the old name no longer fit. New artifacts on release: |
There was a problem hiding this comment.
♻️ Duplicate comments (3)
chart/templates/fence.yaml (2)
27-30: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winConsider removing "update" verb for stricter least-privilege.
The ClusterRole includes both "patch" and "update" verbs, but Kubernetes controllers performing taint operations typically only need "patch". The "update" verb enables full object replacement and is more permissive than necessary for this use case.
🔒 Proposed change for stricter RBAC
rules: - apiGroups: [""] resources: ["nodes"] - verbs: ["get", "list", "watch", "patch", "update"] + verbs: ["get", "list", "watch", "patch"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/templates/fence.yaml` around lines 27 - 30, The ClusterRole rules for resource "nodes" currently list both "patch" and "update"; remove the "update" verb to adhere to least-privilege (keep "patch" for taint operations) by editing the rules block that targets resources: ["nodes"] (the ClusterRole/rules stanza in the fence template) so it no longer includes "update" while retaining "get", "list", "watch", and "patch".
5-5:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove redundant
andoperator from conditional.The
andoperator with a single argument is non-idiomatic Go template syntax. This should be simplified to a direct boolean check.🔧 Proposed fix
-{{- if and .Values.serviceAccount.create }} +{{- if .Values.serviceAccount.create }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/templates/fence.yaml` at line 5, The conditional uses the redundant Go template "and" operator with a single argument; replace the expression that references "and .Values.serviceAccount.create" with a direct boolean check of ".Values.serviceAccount.create" in the template conditional (i.e., remove the "and" operator) so the if statement becomes a simple "{{- if .Values.serviceAccount.create }}" style check.CHANGELOG.md (1)
189-200:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the
0.5.0and0.4.4reference links and move Unreleased baseline forward.The changelog declares
## [0.5.0]and## [0.4.4]but the link definitions still treat0.4.3as latest. This breaks compare navigation for the new releases.🔗 Suggested link fixes
-[Unreleased]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...HEAD +[Unreleased]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD +[0.5.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.4...v0.5.0 +[0.4.4]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4 [0.4.3]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.2...v0.4.3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CHANGELOG.md` around lines 189 - 200, Update the changelog link definitions so the new releases are referenced and Unreleased compares against the newest tag: add link definitions for [0.4.4] and [0.5.0] (e.g. [0.4.4]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4 and [0.5.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.5.0), then change the [Unreleased] baseline to compare from v0.5.0 (https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD) so the "Unreleased" link points to the correct latest released tag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@CHANGELOG.md`:
- Around line 189-200: Update the changelog link definitions so the new releases
are referenced and Unreleased compares against the newest tag: add link
definitions for [0.4.4] and [0.5.0] (e.g. [0.4.4]:
https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4
and [0.5.0]:
https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.5.0),
then change the [Unreleased] baseline to compare from v0.5.0
(https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD)
so the "Unreleased" link points to the correct latest released tag.
In `@chart/templates/fence.yaml`:
- Around line 27-30: The ClusterRole rules for resource "nodes" currently list
both "patch" and "update"; remove the "update" verb to adhere to least-privilege
(keep "patch" for taint operations) by editing the rules block that targets
resources: ["nodes"] (the ClusterRole/rules stanza in the fence template) so it
no longer includes "update" while retaining "get", "list", "watch", and "patch".
- Line 5: The conditional uses the redundant Go template "and" operator with a
single argument; replace the expression that references "and
.Values.serviceAccount.create" with a direct boolean check of
".Values.serviceAccount.create" in the template conditional (i.e., remove the
"and" operator) so the if statement becomes a simple "{{- if
.Values.serviceAccount.create }}" style check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 714fa2e1-53cb-49df-8080-13f3f34992ac
📒 Files selected for processing (16)
.github/workflows/ci.yaml.github/workflows/release.yamlCHANGELOG.mdDockerfileREADME.mdchart/Chart.yamlchart/templates/NOTES.txtchart/templates/_helpers.tplchart/templates/clusterrole.yamlchart/templates/clusterrolebinding.yamlchart/templates/configmap.yamlchart/templates/deployment.yamlchart/templates/fence.yamlchart/templates/secret.yamlchart/templates/serviceaccount.yamlchart/values.yaml
ruff format controller.py (line-length wrapping); Dockerfile COPY multi-arg destination must end with / (DL3021). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Dockerfile (1)
1-1:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winPin Docker base image to digest (floating tag)
Dockerfileline 1 usesFROM python:3.13-slim(floating), violating “Pin base image digest if floating”. This affects reproducibility/supply-chain drift.🔒 Proposed fix
-FROM python:3.13-slim +FROM python:3.13-slim@sha256:<digest>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile` at line 1, The FROM instruction in Dockerfile uses a floating tag ("python:3.13-slim"); replace it with a pinned digest to ensure reproducible builds by resolving the exact image digest (e.g., change the FROM line to reference the image by its sha256 digest such as python@sha256:...); obtain the correct digest for python:3.13-slim (via your registry or docker pull/inspect) and update the Dockerfile's FROM to that digest so the build uses an immutable base image.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@Dockerfile`:
- Line 1: The FROM instruction in Dockerfile uses a floating tag
("python:3.13-slim"); replace it with a pinned digest to ensure reproducible
builds by resolving the exact image digest (e.g., change the FROM line to
reference the image by its sha256 digest such as python@sha256:...); obtain the
correct digest for python:3.13-slim (via your registry or docker pull/inspect)
and update the Dockerfile's FROM to that digest so the build uses an immutable
base image.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: f59dac8e-118e-46d2-82a9-30c4b8736a4d
📒 Files selected for processing (2)
Dockerfilecontroller.py
- controller.py: match out-of-service taint by full (key,value,effect) identity, not key-only, so a same-key taint with a different value/effect isn't mistaken for ours on fence/un-fence - chart fence.yaml: drop redundant `and` in SA conditional; remove `update` verb from fence ClusterRole (patch suffices for taint ops) - CHANGELOG: add 0.5.0/0.4.4 link defs, advance Unreleased baseline - README: bump chart examples 0.4.3 -> 0.5.0; note fence.py in layout Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a second, optional controller (
fence.py,fence.enabled— off by default) sharing this image and reusing the vCenter client + node↔VM map.Why
A passthrough-GPU VM can't be vSphere-HA-restarted elsewhere during a host crash, so the node stays down and its RWO volume stays attached to the dead node — k8s won't auto-detach an unreachable node's volume (split-brain protection). A rescheduled stateful pod then hangs on
Multi-Attachindefinitely. Thenode.kubernetes.io/out-of-servicetaint fixes this (force-detach + force-delete), but must only ever hit a confirmed-dead node.What it does
Two-gate fence, sustained for
fence.graceSeconds:NotReadyruntime.connectionState∈ {disconnected, inaccessible, orphaned}A clean (maintenance) power-off leaves the VM
connected; only a real host loss makes itdisconnected— so this is disjoint from the maintenance controller and the two never collide. Un-fences on recovery (connected + Ready).Taint/un-taint only — power-on stays with vSphere HA (it restarts passthrough VMs on the original host once it reconnects); eviction is handled by
tolerationSeconds+ the taint.Isolation
Separate Deployment, own ServiceAccount, least-privilege ClusterRole (
nodesget/list/watch/patch only — no pods/eviction), independent kill switch (fence.enabled) andfence.dryRun. Off by default — a mis-fire is destructive.Validation
test_fence.py— 8/8: two-gate guard (NotReady-only / disconnected-only → no fence), grace hold, fence-on-grace, no-double-apply, un-fence on recovery, stay-fenced while VM still disconnected,notfound/partition guard.helm lint+ template render (gated; 0 fence resources when disabled).Rollout
v0.5.0(CI builds image incl.fence.py+ publishes chart).Varashi/k8s, setfence.enabled: true+fence.dryRun: true, watch its decisions.dryRun: falseonce trusted.Closes the controller design in Varashi/k8s#166.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores
Tests
Misc