Skip to content

Fix bug where manageMachinePreservation uncordons nodes of Running Machines unconditionally#1107

Draft
thiyyakat wants to merge 5 commits into
gardener:masterfrom
thiyyakat:bug/uncordon-nodes
Draft

Fix bug where manageMachinePreservation uncordons nodes of Running Machines unconditionally#1107
thiyyakat wants to merge 5 commits into
gardener:masterfrom
thiyyakat:bug/uncordon-nodes

Conversation

@thiyyakat

@thiyyakat thiyyakat commented Jun 19, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:
This PR does the following:

  • Fixes a bug where MCM unconditionally uncordoned the backing node of any active machine on every preservation reconcile pass, even when the node had been cordoned externally (e.g. by CA or an operator) for reasons unrelated to preservation.
  • Moves the uncordon into stopPreservationWithNode, gated on phase == MachineRunning, so it only fires when preservation is actively being stopped and the machine has recovered.
  • Orders the uncordon before clearMachinePreserveExpiryTime, so a failed uncordon is retried on the next reconcile instead of being silently dropped.
  • Splits manageMachinePreservation into a gathering phase (reconcileMachinePreservation) and a transition phase (manageMachinePreservation).
  • Adds an early exit for machines with no preservation state.
  • Simplifies getEffectivePreservationAnnotations into a pure function.
  • Removes "" as a valid preservation annotation value.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

IT run with machine-controller-manager-provider-aws

The key behavioral change: uncordoning is gated on preservation being stopped in the current reconcile iteration AND phase == MachineRunning. Machines that are actively preserved, or unrelated to preservation, will never have their node uncordoned by this flow.

Release note:

Fixed a bug where a cordoned node backing a non-preserved Running machine would be unconditionally uncordoned on every reconcile pass through the preservation flow.

@thiyyakat thiyyakat requested a review from a team as a code owner June 19, 2026 04:20
@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Jun 19, 2026
@thiyyakat thiyyakat added kind/bug Bug and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Jun 19, 2026
@gardener-prow gardener-prow Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 19, 2026
@gardener-prow

gardener-prow Bot commented Jun 19, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign unmarshall for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label Jun 19, 2026
@thiyyakat thiyyakat force-pushed the bug/uncordon-nodes branch from 3118c8e to 07404d0 Compare June 19, 2026 06:57
@gardener-prow gardener-prow Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. labels Jun 19, 2026
@thiyyakat thiyyakat marked this pull request as draft June 19, 2026 08:33
@gardener-prow gardener-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2026
@thiyyakat thiyyakat force-pushed the bug/uncordon-nodes branch from 07404d0 to 4ab4e59 Compare June 19, 2026 10:00
… uncordoning of preserved nodes only when stopping preservation.

- preserveMachine and stopPreservationIfActive now return the updated *Machine
so callers can thread the latest object through the reconcile cycle without
re-fetching.
- stopPreservationIfActive is split into stopPreservationOnMachineOnly
and stopPreservationWithNode to make the two code paths explicit and easier to follow.
- The uncordon step is moved into stopPreservationWithNode, gated on
MachineRunning phase, and ordered before clearMachinePreserveExpiryTime so a failed uncordon retries with PreserveExpiryTime still set.
  - `reconcileMachinePreservation` handles annotation validation and node/machine
  state gathering, then delegates to `manageMachinePreservation` which contains
  the handling for the effective preservation value. This separates concerns and makes
  each function individually testable.
  - `getEffectivePreservationAnnotations` is simplified to return only the effective
  value (no longer mutates machine annotations as a side effect).
  - `updatePreserveAnnotationOnMachine` is extracted to handle the node annotation
  tracking sync explicitly, replacing the deferred annotation-diff logic.
  `shouldAnnotationsBeUpdatedOnMachine` encodes the conditions under which that
  sync is needed.
  - Early-exit is added when the machine is neither preserved nor annotated,
  avoiding unnecessary work on the majority of machines.
@thiyyakat thiyyakat force-pushed the bug/uncordon-nodes branch from 4ab4e59 to 0e64a5d Compare June 19, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Bug size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant