Skip to content

OCPEDGE-2474: TNF - Add fencing taint alert agent scripts #6092

Draft
vimauro wants to merge 6 commits into
openshift:mainfrom
vimauro:tnf-fencing-taint
Draft

OCPEDGE-2474: TNF - Add fencing taint alert agent scripts #6092
vimauro wants to merge 6 commits into
openshift:mainfrom
vimauro:tnf-fencing-taint

Conversation

@vimauro
Copy link
Copy Markdown

@vimauro vimauro commented May 27, 2026

- What I did
Added MachineConfig templates that deploy a Pacemaker alert agent, a systemd oneshot service, and a sudoers entry to two-node fencing clusters.
When a fencing event completes, the alert agent triggers the systemd service, which waits for the API server to become writable, then applies the node.kubernetes.io/out-of-service=nodeshutdown:NoExecute taint and the node.kubernetes.io/out-of-service-applied-by: pacemaker annotation on the fenced node via a single atomic patch. This enables fast volume detachment and pod rescheduling instead of waiting for the node controller's NotReady timeout.

- How to verify it

There will be an alert agent setup for TNF in CEO but in the meantime it can be verified by deploying a TNF cluster with those MCO changes, create a pacemaker alert agent manually, fence a node (e.g. pcs stonith fence master-0) and monitor that taint and annotation are set on the fenced node.

- Description for the changelog

Deploy a Pacemaker alert agent that automatically applies the out-of-service taint on fenced nodes in TNF clusters, enabling immediate workload rescheduling after a fencing event.

Summary by CodeRabbit

New Features

  • Automated node tainting during cluster fencing events with retry logic and configurable timeouts
  • Automated node untainting when nodes rejoin the cluster with status validation
  • Pacemaker alert integration to trigger node management during fencing and rejoin events
  • Enhanced cluster maintenance capabilities with improved node lifecycle handling

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 27, 2026

@vimauro: This pull request references OCPEDGE-2474 which is a valid jira issue.

Details

In response to this:

- What I did
Added MachineConfig templates that deploy a Pacemaker alert agent, a systemd oneshot service, and a sudoers entry to two-node fencing clusters.
When a fencing event completes, the alert agent triggers the systemd service, which waits for the API server to become writable, then applies the node.kubernetes.io/out-of-service=nodeshutdown:NoExecute taint and the node.kubernetes.io/out-of-service-applied-by: pacemaker annotation on the fenced node via a single atomic patch. This enables fast volume detachment and pod rescheduling instead of waiting for the node controller's NotReady timeout.

- How to verify it

There will be an alert agent setup for TNF in CEO but in the meantime it can be verified by deploying a TNF cluster with those MCO changes, create a pacemaker alert agent manually, fence a node (e.g. pcs stonith fence master-0) and monitor that taint and annotation are set on the fenced node.

- Description for the changelog

Deploy a Pacemaker alert agent that automatically applies the out-of-service taint on fenced nodes in TNF clusters, enabling immediate workload rescheduling after a fencing event.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds Pacemaker-driven Kubernetes node fencing capability to a two-node cluster template. When Pacemaker fences a node, an alert handler triggers a systemd service that taints the node; when the node rejoins, another handler untaints it. Sudoers permissions enable passwordless taint/untaint service execution. Both taint and untaint scripts implement retry logic and guard conditions to handle transient failures.

Changes

Node Fencing Workflow

Layer / File(s) Summary
Pacemaker alert handlers and sudoers permissions
templates/master/00-master/two-node-with-fencing/files/tnf-taint-alert.yaml, templates/master/00-master/two-node-with-fencing/files/tnf-untaint-alert.yaml, templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
Pacemaker alert scripts detect fencing events (CRM_alert_kind=fencing with rc=0) and node rejoin events (CRM_alert_kind=node with member descriptor), logging events and starting taint/untaint systemd units via sudo. Sudoers drop-in grants hacluster user NOPASSWD execution of both taint-node@* and untaint-node@* with systemctl start --no-block.
Systemd unit templates for taint and untaint operations
templates/master/00-master/two-node-with-fencing/units/taint-node@.service.yaml, templates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
Two oneshot service templates parameterized by node instance (%i). Taint service runs /usr/local/bin/taint-fenced-node.sh %i after network.target with 600s timeout. Untaint service runs /usr/local/bin/untaint-fenced-node.sh %i with same timeout, ordered after both network.target and taint-node@%i.service.
Taint script with node validation and patching retry
templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
Script validates input node name against DNS-like regex, exits on invalid or empty name. Builds a strategic-merge patch payload to set node.kubernetes.io/out-of-service annotation and add NoExecute taint. Runs oc patch node in a retry loop with MAX_RETRIES=120 and RETRY_INTERVAL=5 seconds, logging each failed attempt and exiting 0 on success or 1 after all retries exhausted.
Untaint script with Pacemaker gating and annotation cleanup
templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml
Multi-stage script untaints rejoined nodes. Reads node name argument (exits on empty). Polls crm_mon for node Online status (180s timeout, 5s intervals), exits on timeout. Checks node.kubernetes.io/out-of-service-applied-by annotation via oc and exits successfully if not set to pacemaker. Polls node Ready condition up to 120s (logs warning if not Ready by deadline, continues). Removes taint and annotation with 12 retry attempts (5s intervals), re-verifying both are removed; treats verification oc get failures as retryable, exits successfully when cleared or errors after retries exhausted.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references OCPEDGE-2474 and accurately describes the main change: adding fencing taint alert agent scripts for two-node fencing clusters.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds only YAML MachineConfig templates and bash scripts with no Ginkgo test files, so the test name stability check is not applicable.
Test Structure And Quality ✅ Passed PR contains only YAML templates and shell scripts, no Ginkgo test files. The check for test structure and quality is not applicable to this PR.
Microshift Test Compatibility ✅ Passed PR contains no new Ginkgo e2e tests. All changes are MachineConfig YAML templates with embedded bash scripts and systemd units—check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The changes are infrastructure-as-code YAML templates and scripts for two-node fencing configuration, not test code.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only host-level MachineConfig scripts/units with no Kubernetes manifests or scheduling constraints (no nodeSelector, affinity, tolerations, replicas, PDBs).
Ote Binary Stdout Contract ✅ Passed PR is in machine-config-operator repo with YAML templates and shell scripts for node deployment, not OTE binaries that require JSON stdout contracts.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds MachineConfig YAML templates and shell scripts only. No Ginkgo e2e tests (It(), Describe(), Context(), When()) are present, so IPv6/disconnected network compatibility check does not apply.
No-Weak-Crypto ✅ Passed PR adds YAML configuration and shell scripts for Kubernetes node tainting; no weak cryptographic algorithms, custom crypto, or insecure secret comparisons detected.
Container-Privileges ✅ Passed PR adds MachineConfig templates with no Kubernetes container specs. No privileged containers, host namespaces, SYS_ADMIN, or allowPrivilegeEscalation flags found.
No-Sensitive-Data-In-Logs ✅ Passed All logging uses only non-sensitive data: node names, Kubernetes taint keys, annotation values, status messages, and retry counters. No passwords, tokens, API keys, PII, or credentials are logged.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vimauro
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml (2)

55-70: ⚡ Quick win

Consider detecting node deletion during retry loop.

If the fenced node is deleted from the cluster during the retry loop, the current logic will consume all retry attempts. The || true on Line 57 suppresses the error, leaving NODE_READY empty, so the loop continues trying to patch a non-existent node.

🛡️ Proposed enhancement
 for attempt in $(seq 1 $MAX_RETRIES); do
     # Re-check NotReady on every attempt to close the TOCTOU window
-    NODE_READY=$($OC get node "$FENCED_NODE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null) || true
+    if ! NODE_READY=$($OC get node "$FENCED_NODE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null); then
+        logger -t taint-fenced-node "Node ${FENCED_NODE} no longer exists in cluster — exiting"
+        exit 0
+    fi
     if [ "$NODE_READY" = "True" ]; then
         logger -t taint-fenced-node "Node ${FENCED_NODE} became Ready before taint could be applied — aborting"
         exit 0
     fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml`
around lines 55 - 70, The retry loop may continue patching a node that has been
deleted because NODE_READY is set with "|| true" which silences failures from
the $OC get node call; update the loop in the MAX_RETRIES for-block to
explicitly detect a missing node before attempting patches by checking the
result/exit status of $OC get node for FENCED_NODE (or inspect its output for
empty) and if the node is not found, call logger -t taint-fenced-node with a
clear message and exit (non-zero or appropriate code) instead of continuing
retries; ensure this change is applied around the NODE_READY assignment and
before invoking $OC patch node so the patch/ retry logic only runs for existing
nodes.

25-27: ⚡ Quick win

Consider validating that hostname matches a cluster node name.

The write probe uses $(hostname) to annotate the local node, assuming the machine's hostname matches its Kubernetes node name. If they differ, the probe will fail even when the API server is writable.

🔍 Proposed enhancement
+# Resolve local node name (may differ from hostname)
+LOCAL_NODE=$($OC get nodes -o jsonpath='{.items[?(@.status.addresses[*].address=="'$(hostname -f)'")].metadata.name}' 2>/dev/null | head -n1)
+if [ -z "$LOCAL_NODE" ]; then
+    LOCAL_NODE=$(hostname)
+fi
+
 # Phase 1: wait for API server to become writable (etcd quorum recovery)
 # Read-only checks (oc get nodes) can pass from kube-apiserver watch cache
 # before etcd is fully writable. Use a write probe to confirm writes work.
-until $OC annotate node "$(hostname)" --overwrite __taint-probe-="" &>/dev/null; do
+until $OC annotate node "$LOCAL_NODE" --overwrite __taint-probe-="" &>/dev/null; do
     sleep 5
 done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml`
around lines 25 - 27, The probe currently uses oc annotate node "$(hostname)"
(the __taint-probe- annotate call) without verifying that the local machine's
hostname matches a cluster node name; update the probe to first validate the
node exists by calling oc get node "$(hostname)" (or otherwise query oc get
nodes and match by IP/CRI identifier) and only attempt oc annotate if that check
succeeds, otherwise fail the probe early or resolve the correct node name before
calling oc annotate node "__taint-probe-...".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml`:
- Around line 10-13: The CRM_alert_node value is used directly in the systemctl
unit name (taint-node@${CRM_alert_node}.service) without validation; add a
defensive check that CRM_alert_node is non-empty and matches the expected
node-name pattern (e.g., allowed characters and length using a shell regex like
^[a-zA-Z0-9._-]+$), log an error via logger -t k8s-taint-alert and skip calling
sudo systemctl if the value is invalid or empty, and only construct/execute sudo
systemctl start --no-block "taint-node@${CRM_alert_node}.service" after the
validation passes.

---

Nitpick comments:
In
`@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml`:
- Around line 55-70: The retry loop may continue patching a node that has been
deleted because NODE_READY is set with "|| true" which silences failures from
the $OC get node call; update the loop in the MAX_RETRIES for-block to
explicitly detect a missing node before attempting patches by checking the
result/exit status of $OC get node for FENCED_NODE (or inspect its output for
empty) and if the node is not found, call logger -t taint-fenced-node with a
clear message and exit (non-zero or appropriate code) instead of continuing
retries; ensure this change is applied around the NODE_READY assignment and
before invoking $OC patch node so the patch/ retry logic only runs for existing
nodes.
- Around line 25-27: The probe currently uses oc annotate node "$(hostname)"
(the __taint-probe- annotate call) without verifying that the local machine's
hostname matches a cluster node name; update the probe to first validate the
node exists by calling oc get node "$(hostname)" (or otherwise query oc get
nodes and match by IP/CRI identifier) and only attempt oc annotate if that check
succeeds, otherwise fail the probe early or resolve the correct node name before
calling oc annotate node "__taint-probe-...".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 456c708b-1166-469b-bc88-31a4debbe2f8

📥 Commits

Reviewing files that changed from the base of the PR and between 78e9e0e and e706f82.

📒 Files selected for processing (4)
  • templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
  • templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
  • templates/master/00-master/two-node-with-fencing/units/taint-node@.service.yaml

@vimauro
Copy link
Copy Markdown
Author

vimauro commented May 28, 2026

/label tide/merge-method-squash

@openshift-ci openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml`:
- Around line 42-50: node_is_etcd_voter currently returns non-zero for both “not
a voter” and operational errors, causing callers (the taint decision logic) to
treat transient failures as safe-to-taint; change node_is_etcd_voter so it
returns 0 when the node IS an etcd voter, 1 when the node is definitively NOT a
voter, and a distinct non-zero (e.g., 2) for operational errors
(podman/etcdctl/python failures); update the python/json check in
node_is_etcd_voter to distinguish parse/command errors (exit 2) from the
not-found case (exit 1), and update the taint-decision callers that currently
treat any non-zero as “safe to taint” to only proceed with tainting when
node_is_etcd_voter returns 1 (not a voter), and on error code 2 abort/skip taint
and log the failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 11da447f-2505-4adb-afd5-14ff4398d5b8

📥 Commits

Reviewing files that changed from the base of the PR and between e706f82 and 1f50cc1.

📒 Files selected for processing (3)
  • templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
  • templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
✅ Files skipped from review due to trivial changes (1)
  • templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml

Comment thread templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@templates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yaml`:
- Around line 10-13: The node-name validation permits underscores and uppercase
which later causes untaint-fenced-node.sh to reject the name; update the
CRM_alert_node validation in the k8s-untaint-alert code to match the untaint
script contract by restricting the regex to lowercase letters, digits, dots and
hyphens only (e.g., use a pattern like ^[a-z0-9.-]+$) and keep the existing
logger -t k8s-untaint-alert error path/exit behavior unchanged.
- Around line 15-17: The script currently ignores failures from the sudo
systemctl call that starts the untaint unit and then always returns success via
"exit 0"; update the block that runs sudo systemctl start --no-block
"untaint-node@${CRM_alert_node}.service" to capture its exit status, log an
explicit error (to stderr/syslog or whatever logging mechanism the script uses)
if the start failed, and return a non-zero exit (or implement a bounded retry
with errors logged) instead of unconditionally exiting 0 so rejoin failures are
not silently ignored; key symbols to change are the sudo systemctl start
--no-block "untaint-node@${CRM_alert_node}.service" invocation and the final
exit 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d26bb04b-d518-4c5c-888a-bf53e86745cf

📥 Commits

Reviewing files that changed from the base of the PR and between 1f50cc1 and ceb1f70.

📒 Files selected for processing (5)
  • templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
  • templates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
  • templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml
  • templates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
💤 Files with no reviewable changes (1)
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
  • templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml

Comment thread templates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yaml Outdated
Comment thread templates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yaml Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml`:
- Around line 83-89: The code currently swallows oc get errors for
REMAINING_TAINT and REMAINING_ANNOT which can make both variables empty on
failure and falsely signal success; change the two calls that populate
REMAINING_TAINT and REMAINING_ANNOT (the $OC get node ... jsonpath calls using
${TAINT_KEY}, ${ANNOTATION_KEY}, and ${REJOINED_NODE}) so they do not redirect
stderr to /dev/null or use "|| true", instead capture the command exit status
and treat a non-zero exit as a verification failure (e.g., set a sentinel value
or log and exit non-zero), and update the if condition to require successful oc
get exits plus empty values before declaring success.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b196609-637a-4e08-a0a5-a6fdd67b6951

📥 Commits

Reviewing files that changed from the base of the PR and between ceb1f70 and a355aa8.

📒 Files selected for processing (5)
  • templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
  • templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml
  • templates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
🚧 Files skipped from review as they are similar to previous changes (1)

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml (1)

73-79: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Transient oc get failure can cause false success exit

Similar to the fixed verification logic in lines 85-94, this re-check can falsely exit 0 if oc get fails. When the API call fails, APPLIED_BY becomes empty, "" != "pacemaker" evaluates true, and the script exits claiming the annotation was already removed—leaving the taint in place.

Proposed fix
     for attempt in $(seq 1 $MAX_RETRIES); do
         # Re-check annotation in case another process already cleaned up
-        APPLIED_BY=$($OC get node "$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}" 2>/dev/null) || true
+        if ! APPLIED_BY=$($OC get node "$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}"); then
+            logger -t untaint-fenced-node "Attempt ${attempt}/${MAX_RETRIES}: oc get failed during annotation re-check - retrying in ${RETRY_INTERVAL}s"
+            sleep $RETRY_INTERVAL
+            continue
+        fi
         if [ "$APPLIED_BY" != "$ANNOTATION_VALUE" ]; then
             logger -t untaint-fenced-node "Annotation already removed from ${REJOINED_NODE} - done"
             exit 0
         fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml`
around lines 73 - 79, The re-check using APPLIED_BY=$($OC get node
"$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}"
2>/dev/null) treats an oc failure as an empty value and incorrectly exits;
change the logic to capture the exit status of the $OC get command (or test
whether APPLIED_BY assignment succeeded) and only perform the comparison to
ANNOTATION_VALUE when the get succeeded; if the get failed, log a transient
error via logger -t untaint-fenced-node and continue/retry (sleep and loop)
instead of exiting 0, keeping the existing MAX_RETRIES loop and using
REJOINED_NODE, ANNOTATION_KEY, ANNOTATION_VALUE, and APPLIED_BY identifiers to
locate the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml`:
- Around line 73-79: The re-check using APPLIED_BY=$($OC get node
"$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}"
2>/dev/null) treats an oc failure as an empty value and incorrectly exits;
change the logic to capture the exit status of the $OC get command (or test
whether APPLIED_BY assignment succeeded) and only perform the comparison to
ANNOTATION_VALUE when the get succeeded; if the get failed, log a transient
error via logger -t untaint-fenced-node and continue/retry (sleep and loop)
instead of exiting 0, keeping the existing MAX_RETRIES loop and using
REJOINED_NODE, ANNOTATION_KEY, ANNOTATION_VALUE, and APPLIED_BY identifiers to
locate the fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 42564afc-c436-4a18-80e3-ca1619ccec2a

📥 Commits

Reviewing files that changed from the base of the PR and between a355aa8 and dedcee4.

📒 Files selected for processing (4)
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
  • templates/master/00-master/two-node-with-fencing/files/tnf-taint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/tnf-untaint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml
✅ Files skipped from review due to trivial changes (2)
  • templates/master/00-master/two-node-with-fencing/files/tnf-untaint-alert.yaml
  • templates/master/00-master/two-node-with-fencing/files/tnf-taint-alert.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants