OCPEDGE-2474: TNF - Add fencing taint alert agent scripts #6092
OCPEDGE-2474: TNF - Add fencing taint alert agent scripts #6092vimauro wants to merge 6 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@vimauro: This pull request references OCPEDGE-2474 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds Pacemaker-driven Kubernetes node fencing capability to a two-node cluster template. When Pacemaker fences a node, an alert handler triggers a systemd service that taints the node; when the node rejoins, another handler untaints it. Sudoers permissions enable passwordless taint/untaint service execution. Both taint and untaint scripts implement retry logic and guard conditions to handle transient failures. ChangesNode Fencing Workflow
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vimauro The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml (2)
55-70: ⚡ Quick winConsider detecting node deletion during retry loop.
If the fenced node is deleted from the cluster during the retry loop, the current logic will consume all retry attempts. The
|| trueon Line 57 suppresses the error, leavingNODE_READYempty, so the loop continues trying to patch a non-existent node.🛡️ Proposed enhancement
for attempt in $(seq 1 $MAX_RETRIES); do # Re-check NotReady on every attempt to close the TOCTOU window - NODE_READY=$($OC get node "$FENCED_NODE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null) || true + if ! NODE_READY=$($OC get node "$FENCED_NODE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null); then + logger -t taint-fenced-node "Node ${FENCED_NODE} no longer exists in cluster — exiting" + exit 0 + fi if [ "$NODE_READY" = "True" ]; then logger -t taint-fenced-node "Node ${FENCED_NODE} became Ready before taint could be applied — aborting" exit 0 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml` around lines 55 - 70, The retry loop may continue patching a node that has been deleted because NODE_READY is set with "|| true" which silences failures from the $OC get node call; update the loop in the MAX_RETRIES for-block to explicitly detect a missing node before attempting patches by checking the result/exit status of $OC get node for FENCED_NODE (or inspect its output for empty) and if the node is not found, call logger -t taint-fenced-node with a clear message and exit (non-zero or appropriate code) instead of continuing retries; ensure this change is applied around the NODE_READY assignment and before invoking $OC patch node so the patch/ retry logic only runs for existing nodes.
25-27: ⚡ Quick winConsider validating that
hostnamematches a cluster node name.The write probe uses
$(hostname)to annotate the local node, assuming the machine's hostname matches its Kubernetes node name. If they differ, the probe will fail even when the API server is writable.🔍 Proposed enhancement
+# Resolve local node name (may differ from hostname) +LOCAL_NODE=$($OC get nodes -o jsonpath='{.items[?(@.status.addresses[*].address=="'$(hostname -f)'")].metadata.name}' 2>/dev/null | head -n1) +if [ -z "$LOCAL_NODE" ]; then + LOCAL_NODE=$(hostname) +fi + # Phase 1: wait for API server to become writable (etcd quorum recovery) # Read-only checks (oc get nodes) can pass from kube-apiserver watch cache # before etcd is fully writable. Use a write probe to confirm writes work. -until $OC annotate node "$(hostname)" --overwrite __taint-probe-="" &>/dev/null; do +until $OC annotate node "$LOCAL_NODE" --overwrite __taint-probe-="" &>/dev/null; do sleep 5 done🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml` around lines 25 - 27, The probe currently uses oc annotate node "$(hostname)" (the __taint-probe- annotate call) without verifying that the local machine's hostname matches a cluster node name; update the probe to first validate the node exists by calling oc get node "$(hostname)" (or otherwise query oc get nodes and match by IP/CRI identifier) and only attempt oc annotate if that check succeeds, otherwise fail the probe early or resolve the correct node name before calling oc annotate node "__taint-probe-...".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml`:
- Around line 10-13: The CRM_alert_node value is used directly in the systemctl
unit name (taint-node@${CRM_alert_node}.service) without validation; add a
defensive check that CRM_alert_node is non-empty and matches the expected
node-name pattern (e.g., allowed characters and length using a shell regex like
^[a-zA-Z0-9._-]+$), log an error via logger -t k8s-taint-alert and skip calling
sudo systemctl if the value is invalid or empty, and only construct/execute sudo
systemctl start --no-block "taint-node@${CRM_alert_node}.service" after the
validation passes.
---
Nitpick comments:
In
`@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml`:
- Around line 55-70: The retry loop may continue patching a node that has been
deleted because NODE_READY is set with "|| true" which silences failures from
the $OC get node call; update the loop in the MAX_RETRIES for-block to
explicitly detect a missing node before attempting patches by checking the
result/exit status of $OC get node for FENCED_NODE (or inspect its output for
empty) and if the node is not found, call logger -t taint-fenced-node with a
clear message and exit (non-zero or appropriate code) instead of continuing
retries; ensure this change is applied around the NODE_READY assignment and
before invoking $OC patch node so the patch/ retry logic only runs for existing
nodes.
- Around line 25-27: The probe currently uses oc annotate node "$(hostname)"
(the __taint-probe- annotate call) without verifying that the local machine's
hostname matches a cluster node name; update the probe to first validate the
node exists by calling oc get node "$(hostname)" (or otherwise query oc get
nodes and match by IP/CRI identifier) and only attempt oc annotate if that check
succeeds, otherwise fail the probe early or resolve the correct node name before
calling oc annotate node "__taint-probe-...".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 456c708b-1166-469b-bc88-31a4debbe2f8
📒 Files selected for processing (4)
templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yamltemplates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yamltemplates/master/00-master/two-node-with-fencing/units/taint-node@.service.yaml
|
/label tide/merge-method-squash |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml`:
- Around line 42-50: node_is_etcd_voter currently returns non-zero for both “not
a voter” and operational errors, causing callers (the taint decision logic) to
treat transient failures as safe-to-taint; change node_is_etcd_voter so it
returns 0 when the node IS an etcd voter, 1 when the node is definitively NOT a
voter, and a distinct non-zero (e.g., 2) for operational errors
(podman/etcdctl/python failures); update the python/json check in
node_is_etcd_voter to distinguish parse/command errors (exit 2) from the
not-found case (exit 1), and update the taint-decision callers that currently
treat any non-zero as “safe to taint” to only proceed with tainting when
node_is_etcd_voter returns 1 (not a voter), and on error code 2 abort/skip taint
and log the failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 11da447f-2505-4adb-afd5-14ff4398d5b8
📒 Files selected for processing (3)
templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yamltemplates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
✅ Files skipped from review due to trivial changes (1)
- templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yaml
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@templates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yaml`:
- Around line 10-13: The node-name validation permits underscores and uppercase
which later causes untaint-fenced-node.sh to reject the name; update the
CRM_alert_node validation in the k8s-untaint-alert code to match the untaint
script contract by restricting the regex to lowercase letters, digits, dots and
hyphens only (e.g., use a pattern like ^[a-z0-9.-]+$) and keep the existing
logger -t k8s-untaint-alert error path/exit behavior unchanged.
- Around line 15-17: The script currently ignores failures from the sudo
systemctl call that starts the untaint unit and then always returns success via
"exit 0"; update the block that runs sudo systemctl start --no-block
"untaint-node@${CRM_alert_node}.service" to capture its exit status, log an
explicit error (to stderr/syslog or whatever logging mechanism the script uses)
if the start failed, and return a non-zero exit (or implement a bounded retry
with errors logged) instead of unconditionally exiting 0 so rejoin failures are
not silently ignored; key symbols to change are the sudo systemctl start
--no-block "untaint-node@${CRM_alert_node}.service" invocation and the final
exit 0.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d26bb04b-d518-4c5c-888a-bf53e86745cf
📒 Files selected for processing (5)
templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yamltemplates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yamltemplates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yamltemplates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
💤 Files with no reviewable changes (1)
- templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
✅ Files skipped from review due to trivial changes (1)
- templates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- templates/master/00-master/two-node-with-fencing/files/hacluster-taint-sudoers.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml`:
- Around line 83-89: The code currently swallows oc get errors for
REMAINING_TAINT and REMAINING_ANNOT which can make both variables empty on
failure and falsely signal success; change the two calls that populate
REMAINING_TAINT and REMAINING_ANNOT (the $OC get node ... jsonpath calls using
${TAINT_KEY}, ${ANNOTATION_KEY}, and ${REJOINED_NODE}) so they do not redirect
stderr to /dev/null or use "|| true", instead capture the command exit status
and treat a non-zero exit as a verification failure (e.g., set a sentinel value
or log and exit non-zero), and update the if condition to require successful oc
get exits plus empty values before declaring success.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4b196609-637a-4e08-a0a5-a6fdd67b6951
📒 Files selected for processing (5)
templates/master/00-master/two-node-with-fencing/files/k8s-taint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/k8s-untaint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yamltemplates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yamltemplates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- templates/master/00-master/two-node-with-fencing/units/untaint-node@.service.yaml
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml (1)
73-79:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTransient
oc getfailure can cause false success exitSimilar to the fixed verification logic in lines 85-94, this re-check can falsely exit 0 if
oc getfails. When the API call fails,APPLIED_BYbecomes empty,"" != "pacemaker"evaluates true, and the script exits claiming the annotation was already removed—leaving the taint in place.Proposed fix
for attempt in $(seq 1 $MAX_RETRIES); do # Re-check annotation in case another process already cleaned up - APPLIED_BY=$($OC get node "$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}" 2>/dev/null) || true + if ! APPLIED_BY=$($OC get node "$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}"); then + logger -t untaint-fenced-node "Attempt ${attempt}/${MAX_RETRIES}: oc get failed during annotation re-check - retrying in ${RETRY_INTERVAL}s" + sleep $RETRY_INTERVAL + continue + fi if [ "$APPLIED_BY" != "$ANNOTATION_VALUE" ]; then logger -t untaint-fenced-node "Annotation already removed from ${REJOINED_NODE} - done" exit 0 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml` around lines 73 - 79, The re-check using APPLIED_BY=$($OC get node "$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}" 2>/dev/null) treats an oc failure as an empty value and incorrectly exits; change the logic to capture the exit status of the $OC get command (or test whether APPLIED_BY assignment succeeded) and only perform the comparison to ANNOTATION_VALUE when the get succeeded; if the get failed, log a transient error via logger -t untaint-fenced-node and continue/retry (sleep and loop) instead of exiting 0, keeping the existing MAX_RETRIES loop and using REJOINED_NODE, ANNOTATION_KEY, ANNOTATION_VALUE, and APPLIED_BY identifiers to locate the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@templates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml`:
- Around line 73-79: The re-check using APPLIED_BY=$($OC get node
"$REJOINED_NODE" -o jsonpath="{.metadata.annotations['${ANNOTATION_KEY}']}"
2>/dev/null) treats an oc failure as an empty value and incorrectly exits;
change the logic to capture the exit status of the $OC get command (or test
whether APPLIED_BY assignment succeeded) and only perform the comparison to
ANNOTATION_VALUE when the get succeeded; if the get failed, log a transient
error via logger -t untaint-fenced-node and continue/retry (sleep and loop)
instead of exiting 0, keeping the existing MAX_RETRIES loop and using
REJOINED_NODE, ANNOTATION_KEY, ANNOTATION_VALUE, and APPLIED_BY identifiers to
locate the fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 42564afc-c436-4a18-80e3-ca1619ccec2a
📒 Files selected for processing (4)
templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yamltemplates/master/00-master/two-node-with-fencing/files/tnf-taint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/tnf-untaint-alert.yamltemplates/master/00-master/two-node-with-fencing/files/untaint-fenced-node.yaml
✅ Files skipped from review due to trivial changes (2)
- templates/master/00-master/two-node-with-fencing/files/tnf-untaint-alert.yaml
- templates/master/00-master/two-node-with-fencing/files/tnf-taint-alert.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- templates/master/00-master/two-node-with-fencing/files/taint-fenced-node.yaml
- What I did
Added MachineConfig templates that deploy a Pacemaker alert agent, a systemd oneshot service, and a sudoers entry to two-node fencing clusters.
When a fencing event completes, the alert agent triggers the systemd service, which waits for the API server to become writable, then applies the
node.kubernetes.io/out-of-service=nodeshutdown:NoExecutetaint and thenode.kubernetes.io/out-of-service-applied-by: pacemakerannotation on the fenced node via a single atomic patch. This enables fast volume detachment and pod rescheduling instead of waiting for the node controller's NotReady timeout.- How to verify it
There will be an alert agent setup for TNF in CEO but in the meantime it can be verified by deploying a TNF cluster with those MCO changes, create a pacemaker alert agent manually, fence a node (e.g.
pcs stonith fence master-0) and monitor that taint and annotation are set on the fenced node.- Description for the changelog
Deploy a Pacemaker alert agent that automatically applies the out-of-service taint on fenced nodes in TNF clusters, enabling immediate workload rescheduling after a fencing event.
Summary by CodeRabbit
New Features