Skip to content

Fwupd 4.20 backports#369

Open
jacob-anders wants to merge 15 commits intoopenshift:release-4.20from
jacob-anders:fwupd-4.20-backports
Open

Fwupd 4.20 backports#369
jacob-anders wants to merge 15 commits intoopenshift:release-4.20from
jacob-anders:fwupd-4.20-backports

Conversation

@jacob-anders
Copy link

No description provided.

jacob-anders and others added 15 commits February 5, 2026 19:29
Resolves a bug where firmware updates fail intermittently on some
hardware models due to invalid or unstable BMC responses immediately
after firmware update completion. The BMC may return inconsistent
responses for a period after firmware updates, causing the update
process to fail prematurely.

This change adds comprehensive BMC state validation that requires
multiple consecutive successful responses from System, Manager, and
NetworkAdapters resources before considering the firmware update
complete. This ensures the BMC has fully stabilized before proceeding.

Generated-By: Claude Code Sonnet 4
Change-Id: I5cb72f62d3fc62c3ad750c62924842cef59e79b8
Signed-off-by: Jacob Anders <janders@redhat.com>
(cherry picked from commit 85ec9d6)
Add required boot params in Redfish calls for AsRockRack

Related-Bug: #2073518
Change-Id: I0610d488eb4392bf335464e685aaadbf28d59529
Signed-off-by: Mohamed EL HADDAD <mohamed.el-haddad@ovhcloud.com>
(cherry picked from commit 54977a1)
There have been reports of firmware upgrades failing on Gen11 iLO
machines with GET NetworkAdepters returning 400s responses. This change
attempts to resolve this by catching the exception relevant to the fault

Change-Id: I62095c2b61d14688d2dcbcdcfd29e9391af2c0ba
Signed-off-by: Jacob Anders <janders@redhat.com>
(cherry picked from commit bba3041)
Use extended timeout (by default 300 seconds) for BMC firmware
updates to handle BMC transitional states during firmware update process,
unless a different timeout is specified by the operator.

Assisted-By: Claude Code Sonnet 4
Change-Id: I2125ff4cdcbd07a89b364968dda4bb60e059121c
Signed-off-by: Jacob Anders <janders@redhat.com>
(cherry picked from commit fbe0e18)
Treat absent firmware package version as non-cacheable to avoid NOT NULL
database constraint violation.

Closes-Bug: #2130990
Change-Id: Ic2efaa0d53b6923908112c937957a60aa4f1ad9d
Signed-off-by: Afonne-CID <afonnepaulc@gmail.com>
(cherry picked from commit 5563e52)
On most hardware platforms, each firmware component that can be updated
has different reboot requirements. In addition to this some platforms
are particularly sensitive to reboots happening at the expected time.
This change attempts to make the reboot behavior dependent on the
component being updated in _execute_firmware_update method, so it works
for multi-component scenarios

Assisted-By: Claude Code Sonnet 4.5
Change-Id: Ie4fe72406e3aedb8af246703f13f41e31866f58c
Signed-off-by: Jacob Anders <janders@redhat.com>
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
(cherry picked from commit d90824a)
Some BMCs (particularly HPE iLO) may return is_processing=False while
the firmware update task is still in RUNNING, STARTING, or PENDING state.
The previous code incorrectly treated this as task completion and entered
the completion handler, which only recognizes COMPLETED as success. This
resulted in firmware updates being marked as failed with blank error
messages when the BMC had no error messages to report for an ongoing task.

Assisted-By: Claude Code Sonnet 4.5
Closes-Bug: #2136089
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
Change-Id: I8b61fea63b8af0cf4c3245758538eeb36a7a5b04
(cherry picked from commit 326c4e9)
(cherry picked from commit f6231ae)
Some BMCs (particularly HPE iLO) complete BIOS firmware update tasks
very quickly (within 20-30 seconds) by staging the firmware, but the
firmware is only applied on the next reboot. If the task completes
before Ironic's first poll (which happens ~60 seconds after task
creation), the _handle_bios_task_starting() method never runs and no
reboot is triggered.

This results in:
- BIOS firmware staged but not applied
- Code incorrectly logs "System was already rebooted"
- BIOS version remains unchanged

Observed behavior:
- HPE iLO: Task created at 20:56:23, completed at 20:56:46 (23s)
- First poll at 20:57:24 found task already completed
- No reboot was triggered, BIOS firmware remained at old version

Assisted-By: Claude Code Sonnet 4.5
Closes-Bug: #2136088
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
Change-Id: Idb2cc5a6a1f9415f1ad5b5e36616abe44cb51861
(cherry picked from commit 2306d90)
(cherry picked from commit ba35f1a)
This reduces logging when NetworkAdapters are missing from a redfish bmc
from warning level to debug level. This resolves an issue where loud
logging was reporting on hardware without redfish NetworkAdapters
support.

Generated-by: Claude-code 2.0
Closes-bug: #2133727
Signed-off-by: Jay Faulkner <jay@jvf.cc>
Change-Id: If48757c6ec4a1f7978bd973830020161c55922e4
(cherry picked from commit 18bedb6)
(cherry picked from commit 4cbcb93)
The _validate_resources_stability() function was only catching
sushy.exceptions.BadRequestError, but Dell iDRAC can return
sushy.exceptions.ServerSideError (HTTP 500) and
sushy.exceptions.ConnectionError when BMC resources are temporarily
unavailable during firmware updates.

When these uncaught exceptions occurred, they would crash the periodic
task and prevent notify_conductor_resume_service() from being called,
leaving the node stuck in servicing state even though the firmware
update had completed successfully.

Closes-Bug: #2136087
Change-Id: I3b5806cc4fb055d4264bb6ae9008f57d8c1e0cc1
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
(cherry picked from commit dce1907)
(cherry picked from commit 11b9722)
Service and clean steps that have requires_ramdisk=False operate
independently of the ramdisk agent (e.g., Redfish firmware updates).
These steps should not be subject to heartbeat timeouts since they
do not require the agent to be running.

Previously, the _check_servicewait_timeouts and _check_cleanwait_timeouts
periodic tasks would timeout and fail these steps if no agent heartbeat
was received for service_callback_timeout (default 30 minutes), even
though the step was successfully executing via out-of-band mechanisms.

This caused firmware updates taking longer than 30 minutes to fail with:
"Timeout reached while servicing the node. Please check if the ramdisk
responsible for the servicing is running on the node."

The error message was particularly misleading because the step explicitly
declared requires_ramdisk=False, meaning no ramdisk was expected.

Closes-Bug: #2136276
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
Signed-off-by: Jacob Anders <janders@redhat.com>
Assisted-by: Claude (Anthropic) version 4.5
Change-Id: I1cade32e1dce57441e83cbc9f0b07d9ee5e0ec01
(cherry picked from commit 54ca5a3)
(cherry picked from commit c05c035)
(cherry picked from commit 6d3bec3)
After firmware updates complete and the system reboots, Ironic attempts
to tear down from service by resetting the boot device. However, HPE iLO
BMCs reject boot device modifications while the system is in POST
(Power-On Self-Test), returning an UnableToModifyDuringSystemPOST error.

This change adds retry functionality to assist the operators dealing
with such scenarios.

Closes-Bug: #2136275
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
Signed-off-by: Jacob Anders <janders@redhat.com>
Assisted-By: Claude Opus 4.5
Change-Id: I714030f433d6730a99f9f68cf60ce330e9d43c76
(cherry picked from commit 2d253a0)
(cherry picked from commit 9a4788b)
(cherry picked from commit 6809dd2)
Follow-up to https://review.opendev.org/c/openstack/ironic/+/966344

- Add constants for temporary field names to prevent typos
- Add _clean_temp_fields() helper for centralized cleanup
- Move BMC check start time from driver_internal_info to settings dict
- Change task completion log from INFO to DEBUG

Assisted-By: Claude Opus 4.5
Change-Id: I9394fc3ada08f14ad44dc59839f4f6b97f63a1f5
Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
(cherry picked from commit 07102b6)
This commit changes the default values for the following config options:
- [redfish]firmware_update_wait_unresponsive_bmc
- [redfish]firmware_update_resource_validation_timeout

Closes-Bug: #2139122
Change-Id: Ie964df68e3bf4e0421edf4fedab7748603baf4e0
Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
(cherry picked from commit cb7e0a5)
We saw cases in our testing where the newer information
of the node object wasn't being saved.
The task.upgrade_lock reloads the node from the DB, but the
node variabble would point to the old node object.

Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
Change-Id: Iebef487f7411845958fee19381e58eb448894d82
(cherry picked from commit 54bb9cc)
@jacob-anders
Copy link
Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2026
@openshift-ci openshift-ci bot requested review from dtantsur and elfosardo February 5, 2026 18:32
@openshift-ci
Copy link

openshift-ci bot commented Feb 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacob-anders
Once this PR has been reviewed and has the lgtm label, please assign dtantsur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Feb 5, 2026

@jacob-anders: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/pep8 1e002e3 link true /test pep8

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@iurygregory
Copy link

Do we need efeabac ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants