test(opcua): make secured A&C test deterministic under load#504
Open
mfaferek93 wants to merge 2 commits into
Open
test(opcua): make secured A&C test deterministic under load#504mfaferek93 wants to merge 2 commits into
mfaferek93 wants to merge 2 commits into
Conversation
The secured Alarms and Conditions integration test flaked on loaded runners because the plugin started its poller before the ReportFault service client had matched fault_manager over DDS. AlarmCondition notifications are one-shot and the report is a fire-and-forget async request, so one emitted before that match is dropped with no retry and the alarm never surfaces. Wait (bounded) for the fault sink to be discovered before starting the poller so the first report cannot be lost.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR makes the secured OPC-UA Alarms & Conditions integration path deterministic under load by eliminating a startup ordering race between the OPC-UA poller’s one-shot alarm reporting and DDS discovery of the fault_manager ReportFault service.
Changes:
- Add a bounded (10s) best-effort wait for
/fault_manager/report_faultdiscovery before starting the OPC-UA poller. - Warn when the fault sink is not discovered within the timeout so operators understand alarms may be dropped until it appears.
The wait_for_service timeout was duplicated as a literal 10s in the warning text, so changing it would leave the log stale. Extract it into a named constant and build the message from it, keeping value and text in sync. No behaviour change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_opcua_secured(secured Alarms and Conditions integration test) has beenflaking on loaded runners (Unit tests lyrical, build-and-test humble),
intermittently failing with "alarm did not surface as a CONFIRMED fault" or a
hard ctest timeout. Because it runs in the shared suite, the flake was failing
CI on unrelated open-core PRs this cycle. An earlier change bumped the
fault-surface deadline 40s to 120s and it still flaked, which is the tell that
the alarm report is genuinely lost, not merely slow.
Root cause (a startup ordering race, not a timeout)
The OPC-UA plugin creates its
ReportFaultservice client and then immediatelystarts the poller that detects AlarmCondition state and reports faults. The
alarm report is a one-shot, fire-and-forget async request:
transition; re-observations of the same active state are suppressed as NoOp.
report_faultis sent withasync_send_requestand never retried.So if the poller emits that report before the client has matched fault_manager
over DDS, the request is dropped with no matched service and the alarm never
surfaces. DDS discovery lag is worst exactly when the host is under load, which
is why the flake concentrated on loaded runners. Widening the deadline cannot
help once the single report has been dropped.
Fix
Wait (bounded, 10s) for the
ReportFaultservice to be discovered beforestarting the poller, so the first report cannot be emitted before the fault
sink is matched. This is a deterministic ordering fix, not a bigger timeout. It
is also a production robustness improvement: an alarm that is already active at
startup (replayed via ConditionRefresh) or one that fires immediately is no
longer silently lost. The wait is best-effort so a deployment without a
fault_manager still starts after the timeout.
Verification
Deterministic causal reproduction (the alarm is active before the gateway
connects and fault_manager is started 7s late, so the report is emitted during
startup replay, before discovery):
poller started~1ms after connect and
AlarmCondition CONFIRMED~0.5s after connect, i.e.the report is emitted well before fault_manager exists, then dropped.
poller startedmoves to right afterfault_manager is discovered (connect + ~6.9s), so the report lands.
Stress run: the real (unmodified)
test_opcua_securedwas run 15 consecutivetimes on a 2-CPU-constrained container under additional CPU load: 15/15 passed.
The opcua package unit tests and linters also pass unchanged (the wait is
skipped when the plugin has no ROS node, as in the unit tests).