Skip to content

[rel-v0.62] Fix bug where manageMachinePreservation uncordons nodes of Running Machines unconditionally#1108

Draft
thiyyakat wants to merge 4 commits into
gardener:rel-v0.62from
thiyyakat:bug/uncordon-nodes-hotfix
Draft

[rel-v0.62] Fix bug where manageMachinePreservation uncordons nodes of Running Machines unconditionally#1108
thiyyakat wants to merge 4 commits into
gardener:rel-v0.62from
thiyyakat:bug/uncordon-nodes-hotfix

Conversation

@thiyyakat

Copy link
Copy Markdown
Member

What this PR does / why we need it:

The previous implementation called uncordonNodeIfCordoned whenever IsMachineActive returned true and a node name was present. This meant any Running machine with a cordoned backing node (cordoned for reasons unrelated to preservation, e.g: CA scale-down) would be unconditionally uncordoned on every reconcile pass.

The PR fixes this by:

  1. Introducing a check which ensures non-preservation-bound machines are not candidates for preservation-related operations
  2. Performs uncordonIfCordoned only on machines for which preservation is being stopped, and the machine is in Running phase.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

The following tests were added:

  • Preserved when-failed machine recovers to Running → node uncordoned, PreserveExpiryTime cleared.
  • Preserved when-failed machine expires while Running → node uncordoned, annotation removed.
  • Running machine with no preservation annotations and externally-cordoned node → node untouched.
  • Machine still in Failed phase with active preservation → node stays cordoned.
  • Actively-preserved machines (now, when-failed via node or machine annotation) with a cordoned node → node stays cordoned.
  • Non-preservation-bound machine → zero writes to machine status/annotations/labels/spec and to the backing node.

Release note:

Fixed a bug where a cordoned node backing a non-preserved Running machine would be unconditionally uncordoned on every reconcile pass through the preservation flow.

When stopping preservation on a machine with a backing node, uncordon
the node only when the machine phase is MachineRunning. This step is
ordered before clearMachinePreserveExpiryTime so that a failed uncordon
retries on the next reconcile with PreserveExpiryTime still set.
…d uncordon

- Add an early-exit in manageMachinePreservation when the machine is
  neither preserved (PreserveExpiryTime == nil) nor annotated. This
  avoids unnecessary work on machines unrelated to preservation.
- Remove the uncordon call at the end of the function. Uncordoning is
  now done inside stopPreservationIfActive, gated on MachineRunning.
- Extend manageMachinePreservation test matrix with nodeUnschedulable
  in setup/expect to verify node cordon state after preservation actions.
- Add cases: preserved machine recovers to Running (uncordoned), preservation
  expires while Running (uncordoned + annotation removed), externally-cordoned
  Running machine with no preservation (node left cordoned), preserved Failed
  machine (node left cordoned).
- Fix existing preservation-start entries to expect nodeUnschedulable=true
  since drain cordon the node as part of the preservation flow.
- Add tests to verify that a cordoned node is left untouched when the
  machine has no preservation annotations, and that actively-preserved
  machines keep their node cordoned regardless of which annotation drives preservation.
- It block verifies that manageMachinePreservation performs no writes at
  all (status, annotations, labels, spec, node) for a machine with no
  preservation state.
@thiyyakat thiyyakat requested a review from a team as a code owner June 19, 2026 11:49
@gardener-prow

gardener-prow Bot commented Jun 19, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign takoverflow for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Jun 19, 2026
@thiyyakat thiyyakat marked this pull request as draft June 19, 2026 11:57
@gardener-prow gardener-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant