Conversation
Resolves a bug where firmware updates fail intermittently on some hardware models due to invalid or unstable BMC responses immediately after firmware update completion. The BMC may return inconsistent responses for a period after firmware updates, causing the update process to fail prematurely. This change adds comprehensive BMC state validation that requires multiple consecutive successful responses from System, Manager, and NetworkAdapters resources before considering the firmware update complete. This ensures the BMC has fully stabilized before proceeding. Generated-By: Claude Code Sonnet 4 Change-Id: I5cb72f62d3fc62c3ad750c62924842cef59e79b8 Signed-off-by: Jacob Anders <janders@redhat.com> (cherry picked from commit 85ec9d6)
Add required boot params in Redfish calls for AsRockRack Related-Bug: #2073518 Change-Id: I0610d488eb4392bf335464e685aaadbf28d59529 Signed-off-by: Mohamed EL HADDAD <mohamed.el-haddad@ovhcloud.com> (cherry picked from commit 54977a1)
There have been reports of firmware upgrades failing on Gen11 iLO machines with GET NetworkAdepters returning 400s responses. This change attempts to resolve this by catching the exception relevant to the fault Change-Id: I62095c2b61d14688d2dcbcdcfd29e9391af2c0ba Signed-off-by: Jacob Anders <janders@redhat.com> (cherry picked from commit bba3041)
Use extended timeout (by default 300 seconds) for BMC firmware updates to handle BMC transitional states during firmware update process, unless a different timeout is specified by the operator. Assisted-By: Claude Code Sonnet 4 Change-Id: I2125ff4cdcbd07a89b364968dda4bb60e059121c Signed-off-by: Jacob Anders <janders@redhat.com> (cherry picked from commit fbe0e18)
Treat absent firmware package version as non-cacheable to avoid NOT NULL database constraint violation. Closes-Bug: #2130990 Change-Id: Ic2efaa0d53b6923908112c937957a60aa4f1ad9d Signed-off-by: Afonne-CID <afonnepaulc@gmail.com> (cherry picked from commit 5563e52)
On most hardware platforms, each firmware component that can be updated has different reboot requirements. In addition to this some platforms are particularly sensitive to reboots happening at the expected time. This change attempts to make the reboot behavior dependent on the component being updated in _execute_firmware_update method, so it works for multi-component scenarios Assisted-By: Claude Code Sonnet 4.5 Change-Id: Ie4fe72406e3aedb8af246703f13f41e31866f58c Signed-off-by: Jacob Anders <janders@redhat.com> Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> (cherry picked from commit d90824a)
Some BMCs (particularly HPE iLO) may return is_processing=False while the firmware update task is still in RUNNING, STARTING, or PENDING state. The previous code incorrectly treated this as task completion and entered the completion handler, which only recognizes COMPLETED as success. This resulted in firmware updates being marked as failed with blank error messages when the BMC had no error messages to report for an ongoing task. Assisted-By: Claude Code Sonnet 4.5 Closes-Bug: #2136089 Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> Change-Id: I8b61fea63b8af0cf4c3245758538eeb36a7a5b04 (cherry picked from commit 326c4e9) (cherry picked from commit f6231ae)
Some BMCs (particularly HPE iLO) complete BIOS firmware update tasks very quickly (within 20-30 seconds) by staging the firmware, but the firmware is only applied on the next reboot. If the task completes before Ironic's first poll (which happens ~60 seconds after task creation), the _handle_bios_task_starting() method never runs and no reboot is triggered. This results in: - BIOS firmware staged but not applied - Code incorrectly logs "System was already rebooted" - BIOS version remains unchanged Observed behavior: - HPE iLO: Task created at 20:56:23, completed at 20:56:46 (23s) - First poll at 20:57:24 found task already completed - No reboot was triggered, BIOS firmware remained at old version Assisted-By: Claude Code Sonnet 4.5 Closes-Bug: #2136088 Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> Change-Id: Idb2cc5a6a1f9415f1ad5b5e36616abe44cb51861 (cherry picked from commit 2306d90) (cherry picked from commit ba35f1a)
This reduces logging when NetworkAdapters are missing from a redfish bmc from warning level to debug level. This resolves an issue where loud logging was reporting on hardware without redfish NetworkAdapters support. Generated-by: Claude-code 2.0 Closes-bug: #2133727 Signed-off-by: Jay Faulkner <jay@jvf.cc> Change-Id: If48757c6ec4a1f7978bd973830020161c55922e4 (cherry picked from commit 18bedb6) (cherry picked from commit 4cbcb93)
The _validate_resources_stability() function was only catching sushy.exceptions.BadRequestError, but Dell iDRAC can return sushy.exceptions.ServerSideError (HTTP 500) and sushy.exceptions.ConnectionError when BMC resources are temporarily unavailable during firmware updates. When these uncaught exceptions occurred, they would crash the periodic task and prevent notify_conductor_resume_service() from being called, leaving the node stuck in servicing state even though the firmware update had completed successfully. Closes-Bug: #2136087 Change-Id: I3b5806cc4fb055d4264bb6ae9008f57d8c1e0cc1 Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> (cherry picked from commit dce1907) (cherry picked from commit 11b9722)
Service and clean steps that have requires_ramdisk=False operate independently of the ramdisk agent (e.g., Redfish firmware updates). These steps should not be subject to heartbeat timeouts since they do not require the agent to be running. Previously, the _check_servicewait_timeouts and _check_cleanwait_timeouts periodic tasks would timeout and fail these steps if no agent heartbeat was received for service_callback_timeout (default 30 minutes), even though the step was successfully executing via out-of-band mechanisms. This caused firmware updates taking longer than 30 minutes to fail with: "Timeout reached while servicing the node. Please check if the ramdisk responsible for the servicing is running on the node." The error message was particularly misleading because the step explicitly declared requires_ramdisk=False, meaning no ramdisk was expected. Closes-Bug: #2136276 Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> Signed-off-by: Jacob Anders <janders@redhat.com> Assisted-by: Claude (Anthropic) version 4.5 Change-Id: I1cade32e1dce57441e83cbc9f0b07d9ee5e0ec01 (cherry picked from commit 54ca5a3) (cherry picked from commit c05c035) (cherry picked from commit 6d3bec3)
After firmware updates complete and the system reboots, Ironic attempts to tear down from service by resetting the boot device. However, HPE iLO BMCs reject boot device modifications while the system is in POST (Power-On Self-Test), returning an UnableToModifyDuringSystemPOST error. This change adds retry functionality to assist the operators dealing with such scenarios. Closes-Bug: #2136275 Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> Signed-off-by: Jacob Anders <janders@redhat.com> Assisted-By: Claude Opus 4.5 Change-Id: I714030f433d6730a99f9f68cf60ce330e9d43c76 (cherry picked from commit 2d253a0) (cherry picked from commit 9a4788b) (cherry picked from commit 6809dd2)
Follow-up to https://review.opendev.org/c/openstack/ironic/+/966344 - Add constants for temporary field names to prevent typos - Add _clean_temp_fields() helper for centralized cleanup - Move BMC check start time from driver_internal_info to settings dict - Change task completion log from INFO to DEBUG Assisted-By: Claude Opus 4.5 Change-Id: I9394fc3ada08f14ad44dc59839f4f6b97f63a1f5 Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me> (cherry picked from commit 07102b6)
This commit changes the default values for the following config options: - [redfish]firmware_update_wait_unresponsive_bmc - [redfish]firmware_update_resource_validation_timeout Closes-Bug: #2139122 Change-Id: Ie964df68e3bf4e0421edf4fedab7748603baf4e0 Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> (cherry picked from commit cb7e0a5)
We saw cases in our testing where the newer information of the node object wasn't being saved. The task.upgrade_lock reloads the node from the DB, but the node variabble would point to the old node object. Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com> Change-Id: Iebef487f7411845958fee19381e58eb448894d82 (cherry picked from commit 54bb9cc)
|
/hold |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jacob-anders The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@jacob-anders: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Do we need efeabac ? |
No description provided.