Skip to content

OCPBUGS-86554: Wait for operators after removing master machine#6091

Open
sergiordlr wants to merge 1 commit into
openshift:mainfrom
sergiordlr:wait_for_operators_when_removing_master_machine
Open

OCPBUGS-86554: Wait for operators after removing master machine#6091
sergiordlr wants to merge 1 commit into
openshift:mainfrom
sergiordlr:wait_for_operators_when_removing_master_machine

Conversation

@sergiordlr
Copy link
Copy Markdown
Contributor

@sergiordlr sergiordlr commented May 27, 2026

…ster machine

- What I did

Wait for all cluster operators to be stable after re-creating a master machine in test case [PolarionID:85467][OTP] ControlPlaneMachineSets. Bootimage upgrade stub ignition to spec 3

Since now we wait for the operators to be idle, the duration of the test has greatly increased making the prow job timeout. Since we don't need signals for GA, it has been decided to move the test to the long duration suite.

- How to verify it

When the test ends, all operators should be stable.

Check the intervals in the execution

In this execution the test ends while the operators are still updating. It should not happen. The test should not end until all operators as stable:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/2058601931412082688/periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3/intervals?end=2026-05-24T22%3A10%3A34Z&filterText=&intervalFile=e2e-timelines_spyglass_20260524-183519.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=E2EPassed&start=2026-05-24T21%3A01%3A30Z

Summary by CodeRabbit

  • Tests
    • Restructured test organization into clearer disruptive and long-duration contexts.
    • Improved boot-image upgrade test stability by waiting for cluster operators to stabilize after control-plane recreation, asserting success, and logging confirmation.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sergiordlr: This pull request references Jira Issue OCPBUGS-86554, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…ster machine

- What I did

Wait for all cluster operators to be stable after re-creating a master machine in test case [PolarionID:85467][OTP] ControlPlaneMachineSets. Bootimage upgrade stub ignition to spec 3

- How to verify it

When the test ends, all operators should be stable.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from djoshy and yuqi-zhang May 27, 2026 13:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 595d5730-e45d-4a65-a461-cbc05f8e40f9

📥 Commits

Reviewing files that changed from the base of the PR and between ef7e0be and 10cbf0c.

📒 Files selected for processing (1)
  • test/extended-priv/mco_controlplanemachineset.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended-priv/mco_controlplanemachineset.go

Walkthrough

The Ginkgo spec is reorganized into nested disruptive and longduration contexts; the longduration "Bootimage upgrade stub ignition to spec 3" test now waits for cluster operators to stabilize via WaitForStableCluster(oc.AsAdmin(), "3m", "50m") after control plane machine recreation and logs success.

Changes

ControlPlaneMachineSet tests

Layer / File(s) Summary
Top-level Describe label update
test/extended-priv/mco_controlplanemachineset.go
Removed the [Suite:openshift/machine-config-operator/disruptive] tag from the outer g.Describe, moving suite tagging to nested contexts.
Disruptive context and regrouped tests
test/extended-priv/mco_controlplanemachineset.go
Introduced a nested disruptive g.Context and moved related g.It cases (marketplace boot-image handling, Partial/None mode behavior, owner-ref checks, MachineConfiguration status tests) into it.
Longduration test: add cluster stability wait
test/extended-priv/mco_controlplanemachineset.go
In the longduration g.Context, the "Bootimage upgrade stub ignition to spec 3" test now calls WaitForStableCluster(oc.AsAdmin(), "3m", "50m") after control plane machine recreation, asserts success, and logs OK! before continuing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, verified

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: moving the test that waits for operators after removing a master machine to the long-duration suite with the added wait logic.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in mco_controlplanemachineset.go are static with no dynamic content, timestamps, UUIDs, pod/node names, or variables—they are deterministic and appropriately descriptive.
Test Structure And Quality ✅ Passed Code meets all requirements: single responsibility per test, proper cleanup with defer, timeouts on cluster operations (50m), meaningful assertion messages, consistent codebase patterns.
Microshift Test Compatibility ✅ Passed No new Ginkgo tests added. All existing tests already have [apigroup:machineconfiguration.openshift.io] tag and SkipOnSingleNodeTopology() guard, protecting them from MicroShift.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test makes multi-node assumptions (expects 3 control planes) but is protected by SkipOnSingleNodeTopology() in JustBeforeEach hook that applies to all nested tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test code in test/extended-priv/mco_controlplanemachineset.go, not deployment manifests or operator code. Test properly skips on single-node topology.
Ote Binary Stdout Contract ✅ Passed No process-level stdout violations found. All code within Ginkgo test blocks; logging uses ginkgo.GinkgoWriter; WaitForStableCluster() call is within It() block.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The test file contains no IPv4-specific assumptions, hardcoded IPv4 addresses, or external connectivity requirements. Tests operate within the cluster and use cluster-internal APIs.
No-Weak-Crypto ✅ Passed The PR changes only a test file with no weak crypto algorithms, crypto imports, custom crypto implementations, or non-constant-time comparisons of secrets.
Container-Privileges ✅ Passed PR modifies only test code (Go test files), no Kubernetes manifests or container specs with privileged settings are added or modified.
No-Sensitive-Data-In-Logs ✅ Passed PR adds logging for test flow and Kubernetes API metadata only. No passwords, tokens, keys, PII, session IDs, hostnames, or sensitive customer data are logged.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sergiordlr
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2e7ec460-59d2-11f1-823e-6414ed7b54ca-0

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sergiordlr: An error was encountered searching for bug OCPBUGS-86554 on the Jira server at https://redhat.atlassian.net. No known errors were detected, please see the full error message for details.

Full error message. No response returned: Get "https://redhat.atlassian.net/rest/api/2/issue/OCPBUGS-86554": GET https://redhat.atlassian.net/rest/api/2/issue/OCPBUGS-86554 giving up after 5 attempt(s)

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

…ster machine

- What I did

Wait for all cluster operators to be stable after re-creating a master machine in test case [PolarionID:85467][OTP] ControlPlaneMachineSets. Bootimage upgrade stub ignition to spec 3

- How to verify it

When the test ends, all operators should be stable.

Check the intervals in the execution

In this execution the test ends while the operators are still updating. It should not happen. The test should not end until all operators as stable:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/2058601931412082688/periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3/intervals?end=2026-05-24T22%3A10%3A34Z&filterText=&intervalFile=e2e-timelines_spyglass_20260524-183519.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=E2EPassed&start=2026-05-24T21%3A01%3A30Z

Summary by CodeRabbit

  • Tests
  • Enhanced test stability by adding cluster operator synchronization verification following machine recreation operations in boot-image upgrade testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sergiordlr: The referenced Jira(s) [OCPBUGS-86554] could not be located, all automatically applied jira labels will be removed.

Details

In response to this:

…ster machine

- What I did

Wait for all cluster operators to be stable after re-creating a master machine in test case [PolarionID:85467][OTP] ControlPlaneMachineSets. Bootimage upgrade stub ignition to spec 3

- How to verify it

When the test ends, all operators should be stable.

Check the intervals in the execution

In this execution the test ends while the operators are still updating. It should not happen. The test should not end until all operators as stable:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/2058601931412082688/periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3/intervals?end=2026-05-24T22%3A10%3A34Z&filterText=&intervalFile=e2e-timelines_spyglass_20260524-183519.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=E2EPassed&start=2026-05-24T21%3A01%3A30Z

Summary by CodeRabbit

  • Tests
  • Enhanced test stability by adding cluster operator synchronization verification following machine recreation operations in boot-image upgrade testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Copy Markdown
Contributor

djoshy commented May 27, 2026

/retitle OCPBUGS-86554: Wait for operators after removing master machine

@openshift-ci openshift-ci Bot changed the title OCPBUGS-86554: in test OCP-85467 wait for operators after removing ma… OCPBUGS-86554: Wait for operators after removing master machine May 27, 2026
@sergiordlr sergiordlr force-pushed the wait_for_operators_when_removing_master_machine branch from 3814c82 to 398e0d3 Compare May 27, 2026 13:54
@sergiordlr
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a44edf80-59d3-11f1-9aab-20190993fc53-0

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sergiordlr: This pull request references Jira Issue OCPBUGS-86554, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…ster machine

- What I did

Wait for all cluster operators to be stable after re-creating a master machine in test case [PolarionID:85467][OTP] ControlPlaneMachineSets. Bootimage upgrade stub ignition to spec 3

- How to verify it

When the test ends, all operators should be stable.

Check the intervals in the execution

In this execution the test ends while the operators are still updating. It should not happen. The test should not end until all operators as stable:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/2058601931412082688/periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3/intervals?end=2026-05-24T22%3A10%3A34Z&filterText=&intervalFile=e2e-timelines_spyglass_20260524-183519.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=E2EPassed&start=2026-05-24T21%3A01%3A30Z

Summary by CodeRabbit

  • Tests
  • Improved boot-image upgrade test stability by waiting for cluster operators to reach a stable state after control-plane machine recreation and adding confirmation logging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Copy Markdown
Contributor

djoshy commented May 27, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-86554, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sergiordlr
Copy link
Copy Markdown
Contributor Author

/retest

@sergiordlr
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0ce64f0-5a67-11f1-9ec2-5a31d598467f-0

@sergiordlr sergiordlr force-pushed the wait_for_operators_when_removing_master_machine branch from 398e0d3 to ef7e0be Compare May 28, 2026 14:52
@sergiordlr
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1faef760-5aa5-11f1-98eb-c0fb1841307d-0

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sergiordlr: This pull request references Jira Issue OCPBUGS-86554, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

…ster machine

- What I did

Wait for all cluster operators to be stable after re-creating a master machine in test case [PolarionID:85467][OTP] ControlPlaneMachineSets. Bootimage upgrade stub ignition to spec 3

- How to verify it

When the test ends, all operators should be stable.

Check the intervals in the execution

In this execution the test ends while the operators are still updating. It should not happen. The test should not end until all operators as stable:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/2058601931412082688/periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3/intervals?end=2026-05-24T22%3A10%3A34Z&filterText=&intervalFile=e2e-timelines_spyglass_20260524-183519.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=E2EPassed&start=2026-05-24T21%3A01%3A30Z

Summary by CodeRabbit

  • Tests
  • Restructured test organization into clearer contexts for disruptive and long-duration suites.
  • Improved boot-image upgrade test stability by waiting for cluster operators to reach a stable state after control-plane machine recreation, asserting success, and adding confirmation logging to signal completion.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@djoshy
Copy link
Copy Markdown
Contributor

djoshy commented May 28, 2026

/lgtm

Thank you for the fix - seems sane to me, let's make sure the suites are green before merging.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026
@sergiordlr
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-disruptive-techpreview-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1fbd97c0-5b31-11f1-8b31-c7894e2a6a3e-0

@sergiordlr
Copy link
Copy Markdown
Contributor Author

/retest

@sergiordlr sergiordlr force-pushed the wait_for_operators_when_removing_master_machine branch from ef7e0be to 10cbf0c Compare May 29, 2026 07:56
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, sergiordlr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2026
@sergiordlr
Copy link
Copy Markdown
Contributor Author

The test is now in periodic-ci-openshift-machine-config-operator-release-4.23-periodics-e2e-aws-mco-fips-proxy-longduration-1of2

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-machine-config-operator-release-4.23-periodics-e2e-aws-mco-fips-proxy-longduration-1of2/2060018867723309056

Nevertheless, it seems that the etcd operator is still updating when the test ends. 30s of stability is not enough, etcd started updating again after 40-50 seconds. We have modified the stability period to 3 minutes to make sure that it is really stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants