Skip to content

OCPBUGS-86571: templates: disable IPv4 DAD to fix nodeip-configuration race#6098

Closed
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:fix/disable-ipv4-dad
Closed

OCPBUGS-86571: templates: disable IPv4 DAD to fix nodeip-configuration race#6098
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:fix/disable-ipv4-dad

Conversation

@mkowalski
Copy link
Copy Markdown
Contributor

@mkowalski mkowalski commented May 28, 2026

Summary

RHEL 10 enables IPv4 DAD (Duplicate Address Detection / ACD) by default in NetworkManager. The ACD probing takes ~3 seconds, during which the IPv4 address is in tentative state and invisible to applications.

This causes a race condition in dual-stack baremetal clusters where nodeip-configuration.service starts before IPv4 is assigned, sees only IPv6, and configures kubelet with IPv6-only — despite the interface eventually getting both addresses.

Root Cause

RHEL 10 changed the default ipv4.dad-timeout from 0 (disabled) to a non-zero value, enabling ACD probing for all IPv4 addresses. IPv6 DAD completes faster (~2s), so nodeip-configuration.service runs in the window where IPv6 is ready but IPv4 is still probing:

T+0s  Interface up, IPv6 tentative, IPv4 ACD probing starts
T+2s  IPv6 DAD complete, IPv4 still probing
      nodeip-configuration.service runs → sees only IPv6 → writes IPv6-only config
T+4s  IPv4 ACD complete (too late)

Fix

Add a global NetworkManager drop-in (/etc/NetworkManager/conf.d/01-no-dad.conf) that sets ipv4.dad-timeout=0, restoring the RHEL 9 behavior. This follows the same pattern as the existing 01-ipv6.conf drop-in.

Fixes: https://issues.redhat.com/browse/OCPBUGS-86571


🤖 This PR was created by OpenClaw on behalf of @mkowalski.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed race condition in dual-stack Kubernetes node IP configuration by adjusting IPv4 Duplicate Address Detection handling in NetworkManager.

RHEL 10 enables IPv4 Duplicate Address Detection (DAD / ACD) by default
in NetworkManager. The ACD probing takes ~3 seconds, during which the
IPv4 address remains in tentative state and is not visible to
applications querying interface addresses.

This introduces a race condition in dual-stack baremetal clusters where
nodeip-configuration.service starts before the IPv4 address is assigned.
The service only sees the IPv6 address (which completes DAD faster) and
configures kubelet with IPv6-only, despite the interface eventually
getting both addresses.

Timeline observed on affected nodes:
  T+0s  Interface up, IPv6 tentative, IPv4 ACD probing starts
  T+2s  IPv6 DAD complete, IPv4 still probing
        nodeip-configuration.service runs → sees only IPv6 → writes
        IPv6-only config
  T+4s  IPv4 ACD complete (too late)

Fix this by adding a global NetworkManager drop-in that sets
ipv4.dad-timeout=0, restoring the RHEL 9 behavior where IPv4 addresses
are assigned immediately without ACD probing.

Fixes: https://issues.redhat.com/browse/OCPBUGS-86571

Generated-by: OpenClaw OpenClaw 2026.5.12 (f066dd2)
AI-model: claude-opus-4.6
Signed-off-by: Mateusz Kowalski <mko@redhat.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Walkthrough

A NetworkManager configuration template file is populated to deploy /etc/NetworkManager/conf.d/01-no-dad.conf with content that disables IPv4 Duplicate Address Detection (DAD) by setting ipv4.dad-timeout=0. The configuration addresses a race condition affecting dual-stack Kubernetes node IP assignment on RHEL 10.

Changes

NetworkManager IPv4 DAD Configuration

Layer / File(s) Summary
IPv4 DAD timeout configuration
templates/common/_base/files/NetworkManager-no-dad.yaml
Template now contains file deployment directive with mode 0644 and inline NetworkManager configuration that disables IPv4 DAD by setting ipv4.dad-timeout=0, including comments explaining the RHEL 10 race condition impact.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR only modifies NetworkManager-no-dad.yaml (a configuration file), not test files. No Ginkgo test definitions added or modified.
Test Structure And Quality ✅ Passed PR adds well-structured Ginkgo tests with single responsibility, proper setup/cleanup, timeouts on cluster operations, meaningful assertion messages, and consistent codebase patterns.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The change is a NetworkManager configuration file, not a test.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies only NetworkManager configuration YAML file, not Ginkgo e2e tests. The SNO compatibility check applies only to new test additions, not infrastructure configuration changes.
Topology-Aware Scheduling Compatibility ✅ Passed Change adds a NetworkManager configuration template, not deployment manifests, operator code, or controllers. No pod scheduling constraints or Kubernetes scheduling constructs introduced.
Ote Binary Stdout Contract ✅ Passed PR only modifies a YAML NetworkManager configuration template; no Go code, test code, or process-level stdout writes are present in this change.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds NetworkManager configuration file to disable IPv4 DAD, not new Ginkgo e2e tests. Check is not applicable to non-test changes.
No-Weak-Crypto ✅ Passed PR adds only a NetworkManager configuration file with no weak crypto (MD5/SHA1/DES/RC4/3DES/Blowfish/ECB), no custom crypto implementations, and no secret comparisons.
Container-Privileges ✅ Passed PR adds NetworkManager config file only; no container/K8s manifests with privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, or allowPrivilegeEscalation found.
No-Sensitive-Data-In-Logs ✅ Passed The PR adds a NetworkManager configuration file with no sensitive data exposure. No passwords, tokens, API keys, PII, session IDs, or customer data are present in the added content.
Title check ✅ Passed The title accurately describes the main change: adding a NetworkManager configuration to disable IPv4 DAD to fix a nodeip-configuration race condition.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkowalski
Once this PR has been reviewed and has the lgtm label, please assign umohnani8 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mkowalski mkowalski changed the title templates: disable IPv4 DAD to fix nodeip-configuration race OCPBUGS-86571: templates: disable IPv4 DAD to fix nodeip-configuration race May 28, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-86571, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

RHEL 10 enables IPv4 DAD (Duplicate Address Detection / ACD) by default in NetworkManager. The ACD probing takes ~3 seconds, during which the IPv4 address is in tentative state and invisible to applications.

This causes a race condition in dual-stack baremetal clusters where nodeip-configuration.service starts before IPv4 is assigned, sees only IPv6, and configures kubelet with IPv6-only — despite the interface eventually getting both addresses.

Root Cause

RHEL 10 changed the default ipv4.dad-timeout from 0 (disabled) to a non-zero value, enabling ACD probing for all IPv4 addresses. IPv6 DAD completes faster (~2s), so nodeip-configuration.service runs in the window where IPv6 is ready but IPv4 is still probing:

T+0s  Interface up, IPv6 tentative, IPv4 ACD probing starts
T+2s  IPv6 DAD complete, IPv4 still probing
     nodeip-configuration.service runs → sees only IPv6 → writes IPv6-only config
T+4s  IPv4 ACD complete (too late)

Fix

Add a global NetworkManager drop-in (/etc/NetworkManager/conf.d/01-no-dad.conf) that sets ipv4.dad-timeout=0, restoring the RHEL 9 behavior. This follows the same pattern as the existing 01-ipv6.conf drop-in.

Fixes: https://issues.redhat.com/browse/OCPBUGS-86571


🤖 This PR was created by OpenClaw on behalf of @mkowalski.

Summary by CodeRabbit

  • Bug Fixes
  • Fixed race condition in dual-stack Kubernetes node IP configuration by adjusting IPv4 Duplicate Address Detection handling in NetworkManager.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-86571, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rbbratta

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from rbbratta May 28, 2026 08:50
@mkowalski
Copy link
Copy Markdown
Contributor Author

/jira backport release-4.22

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@mkowalski: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

Details

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@openshift-ci-robot: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

@mkowalski: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mkowalski
Copy link
Copy Markdown
Contributor Author

/hold

It's not necessarily what I want. Ref.: https://redhat-internal.slack.com/archives/C04M1SH1VNZ/p1779891220519739

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2026
@mkowalski
Copy link
Copy Markdown
Contributor Author

@openshift-ci openshift-ci Bot closed this May 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@mkowalski: Closed this PR.

Details

In response to this:

/close

Rather not. Superseded by

  1. OCPBUGS-86571: node-ip: wait for both address families on dual-stack clusters baremetal-runtimecfg#391
  2. OCPBUGS-86571: templates: pass --dual-stack to nodeip-configuration #6105

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-86571. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

RHEL 10 enables IPv4 DAD (Duplicate Address Detection / ACD) by default in NetworkManager. The ACD probing takes ~3 seconds, during which the IPv4 address is in tentative state and invisible to applications.

This causes a race condition in dual-stack baremetal clusters where nodeip-configuration.service starts before IPv4 is assigned, sees only IPv6, and configures kubelet with IPv6-only — despite the interface eventually getting both addresses.

Root Cause

RHEL 10 changed the default ipv4.dad-timeout from 0 (disabled) to a non-zero value, enabling ACD probing for all IPv4 addresses. IPv6 DAD completes faster (~2s), so nodeip-configuration.service runs in the window where IPv6 is ready but IPv4 is still probing:

T+0s  Interface up, IPv6 tentative, IPv4 ACD probing starts
T+2s  IPv6 DAD complete, IPv4 still probing
     nodeip-configuration.service runs → sees only IPv6 → writes IPv6-only config
T+4s  IPv4 ACD complete (too late)

Fix

Add a global NetworkManager drop-in (/etc/NetworkManager/conf.d/01-no-dad.conf) that sets ipv4.dad-timeout=0, restoring the RHEL 9 behavior. This follows the same pattern as the existing 01-ipv6.conf drop-in.

Fixes: https://issues.redhat.com/browse/OCPBUGS-86571


🤖 This PR was created by OpenClaw on behalf of @mkowalski.

Summary by CodeRabbit

  • Bug Fixes
  • Fixed race condition in dual-stack Kubernetes node IP configuration by adjusting IPv4 Duplicate Address Detection handling in NetworkManager.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants