Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,4 @@ jobs:
context: .
platforms: linux/amd64,linux/arm64
push: false
tags: ci/gpu-node-vsphere-maintenance-controller:ci
tags: ci/vsphere-passthrough-node-controller:ci
6 changes: 3 additions & 3 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ jobs:
run: |
version="${{ steps.ver.outputs.version }}"
# helm push prints "Digest: sha256:..." to stderr; tee to capture.
helm push "gpu-node-vsphere-maintenance-controller-${version}.tgz" \
helm push "vsphere-passthrough-node-controller-${version}.tgz" \
"oci://${{ env.CHART_REPO }}" 2>&1 | tee push.log
digest=$(awk '/^Digest: /{print $2}' push.log)
if [ -z "$digest" ]; then
Expand All @@ -160,7 +160,7 @@ jobs:
- name: Cosign keyless sign (chart)
env:
DIGEST: ${{ steps.chart_push.outputs.digest }}
CHART_REF: ${{ env.CHART_REPO }}/gpu-node-vsphere-maintenance-controller
CHART_REF: ${{ env.CHART_REPO }}/vsphere-passthrough-node-controller
run: cosign sign --yes "${CHART_REF}@${DIGEST}"

- name: Create GitHub Release
Expand All @@ -173,4 +173,4 @@ jobs:
prerelease: false
files: |
sbom.spdx.json
gpu-node-vsphere-maintenance-controller-${{ steps.ver.outputs.version }}.tgz
vsphere-passthrough-node-controller-${{ steps.ver.outputs.version }}.tgz
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__pycache__/
51 changes: 38 additions & 13 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,29 @@ this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Unreleased]

## [0.5.0] — 2026-06-03
Comment thread
coderabbitai[bot] marked this conversation as resolved.

### Added
- **Crash-fence controller** (`fence.py`) — a second, optional Deployment
(`fence.enabled`, **off by default**) that shares this image and reuses the
vCenter client + node↔VM mapping. It automates non-graceful node shutdown for
passthrough-GPU workers that vSphere HA can't restart elsewhere during a host
crash: it applies the `node.kubernetes.io/out-of-service` taint to a node
confirmed dead by **both** gates — k8s `NotReady` **and** vCenter VM
`runtime.connectionState` in `{disconnected, inaccessible, orphaned}` —
sustained for `fence.graceSeconds`, so RWO volumes force-detach and stateful
pods reschedule. The taint is removed on recovery (VM `connected` + node
`Ready`).
- **Disjoint from the maintenance controller**: a clean (maintenance)
power-off leaves the VM `connected`; only a real host loss makes it
`disconnected`. The two controllers trigger on different vCenter facts and
never collide — no coordination contract needed.
- **Taint/un-taint only.** Power-on is owned by vSphere HA (it restarts
passthrough VMs on the original host once it returns); eviction is handled
by `tolerationSeconds` + the taint.
- Own ServiceAccount + least-privilege ClusterRole (`nodes` get/list/watch/
patch only) + kill switch (`fence.enabled`) + independent `fence.dryRun`.

## [0.4.4] — 2026-05-01

### Fixed
Expand Down Expand Up @@ -84,7 +107,7 @@ No controller code change. Supply-chain and CI polish only.
now consults the map instead of making a per-node `get_vm_host` round-trip
to vCenter on every poll.
- Minimal Helm chart under `chart/`, published as OCI to
`ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller`.
`ghcr.io/varashi/charts/vsphere-passthrough-node-controller`.
- GitHub Actions: `ci.yaml` (ruff, hadolint, helm lint, buildx smoke build)
on pull requests; `release.yaml` on `v*.*.*` tag push builds multi-arch
images (amd64, arm64), cosign-signs keyless via OIDC, attaches SBOM and
Expand Down Expand Up @@ -163,15 +186,17 @@ No controller code change. Supply-chain and CI polish only.
- Initial release: drain → power-off → wait-for-exit → power-on →
uncordon, driven by edge-triggered `HostSystem.recentTask` polling.

[Unreleased]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.3...HEAD
[0.4.3]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.2...v0.4.3
[0.4.2]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.1...v0.4.2
[0.4.1]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.0...v0.4.1
[0.4.0]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.2.3...v0.3.0
[0.2.3]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.2.2...v0.2.3
[0.2.2]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.2.1...v0.2.2
[0.2.1]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.2.0...v0.2.1
[0.2.0]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.1.1...v0.2.0
[0.1.1]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.1.0...v0.1.1
[0.1.0]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/releases/tag/v0.1.0
[Unreleased]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD
[0.5.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.4...v0.5.0
[0.4.4]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4
[0.4.3]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.2...v0.4.3
[0.4.2]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.1...v0.4.2
[0.4.1]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.0...v0.4.1
[0.4.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.2.3...v0.3.0
[0.2.3]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.2.2...v0.2.3
[0.2.2]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.2.1...v0.2.2
[0.2.1]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.2.0...v0.2.1
[0.2.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.1.1...v0.2.0
[0.1.1]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.1.0...v0.1.1
[0.1.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/releases/tag/v0.1.0
10 changes: 6 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
FROM python:3.13-slim

LABEL org.opencontainers.image.title="gpu-node-vsphere-maintenance-controller"
LABEL org.opencontainers.image.title="vsphere-passthrough-node-controller"
LABEL org.opencontainers.image.description="Kubernetes controller that automates ESXi maintenance mode for worker nodes with PCI passthrough (GPU or otherwise)."
LABEL org.opencontainers.image.source="https://github.com/Varashi/gpu-node-vsphere-maintenance-controller"
LABEL org.opencontainers.image.documentation="https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/blob/main/README.md"
LABEL org.opencontainers.image.source="https://github.com/Varashi/vsphere-passthrough-node-controller"
LABEL org.opencontainers.image.documentation="https://github.com/Varashi/vsphere-passthrough-node-controller/blob/main/README.md"
LABEL org.opencontainers.image.licenses="MIT"

WORKDIR /app

RUN pip install --no-cache-dir --disable-pip-version-check \
pyVmomi==8.0.3.0.1 kubernetes==31.0.0

COPY controller.py .
COPY controller.py fence.py ./

# Default entrypoint = maintenance controller. The fence controller (fence.py)
# is the same image with the command overridden to `python -u fence.py`.
CMD ["python", "-u", "controller.py"]
Loading
Loading