test(opcua): make secured A&C test deterministic under load by mfaferek93 · Pull Request #504 · selfpatch/ros2_medkit

mfaferek93 · 2026-07-04T20:12:41Z

Problem

test_opcua_secured (secured Alarms and Conditions integration test) has been
flaking on loaded runners (Unit tests lyrical, build-and-test humble),
intermittently failing with "alarm did not surface as a CONFIRMED fault" or a
hard ctest timeout. Because it runs in the shared suite, the flake was failing
CI on unrelated open-core PRs this cycle. An earlier change bumped the
fault-surface deadline 40s to 120s and it still flaked, which is the tell that
the alarm report is genuinely lost, not merely slow.

Root cause (a startup ordering race, not a timeout)

The OPC-UA plugin creates its ReportFault service client and then immediately
starts the poller that detects AlarmCondition state and reports faults. The
alarm report is a one-shot, fire-and-forget async request:

An AlarmCondition surfaces a fault exactly once, on the inactive to active
transition; re-observations of the same active state are suppressed as NoOp.
report_fault is sent with async_send_request and never retried.

So if the poller emits that report before the client has matched fault_manager
over DDS, the request is dropped with no matched service and the alarm never
surfaces. DDS discovery lag is worst exactly when the host is under load, which
is why the flake concentrated on loaded runners. Widening the deadline cannot
help once the single report has been dropped.

Fix

Wait (bounded, 10s) for the ReportFault service to be discovered before
starting the poller, so the first report cannot be emitted before the fault
sink is matched. This is a deterministic ordering fix, not a bigger timeout. It
is also a production robustness improvement: an alarm that is already active at
startup (replayed via ConditionRefresh) or one that fires immediately is no
longer silently lost. The wait is best-effort so a deployment without a
fault_manager still starts after the timeout.

Verification

Deterministic causal reproduction (the alarm is active before the gateway
connects and fault_manager is started 7s late, so the report is emitted during
startup replay, before discovery):

Without the fix: the alarm is lost 4/4. The gateway logs poller started
~1ms after connect and AlarmCondition CONFIRMED ~0.5s after connect, i.e.
the report is emitted well before fault_manager exists, then dropped.
With the fix: the alarm surfaces 4/4. poller started moves to right after
fault_manager is discovered (connect + ~6.9s), so the report lands.

Stress run: the real (unmodified) test_opcua_secured was run 15 consecutive
times on a 2-CPU-constrained container under additional CPU load: 15/15 passed.
The opcua package unit tests and linters also pass unchanged (the wait is
skipped when the plugin has no ROS node, as in the unit tests).

The secured Alarms and Conditions integration test flaked on loaded runners because the plugin started its poller before the ReportFault service client had matched fault_manager over DDS. AlarmCondition notifications are one-shot and the report is a fire-and-forget async request, so one emitted before that match is dropped with no retry and the alarm never surfaces. Wait (bounded) for the fault sink to be discovered before starting the poller so the first report cannot be lost.

Copilot

Pull request overview

This PR makes the secured OPC-UA Alarms & Conditions integration path deterministic under load by eliminating a startup ordering race between the OPC-UA poller’s one-shot alarm reporting and DDS discovery of the fault_manager ReportFault service.

Changes:

Add a bounded (10s) best-effort wait for /fault_manager/report_fault discovery before starting the OPC-UA poller.
Warn when the fault sink is not discovered within the timeout so operators understand alarms may be dropped until it appears.

The wait_for_service timeout was duplicated as a literal 10s in the warning text, so changing it would leave the log stale. Extract it into a named constant and build the message from it, keeping value and text in sync. No behaviour change.

Copilot AI review requested due to automatic review settings July 4, 2026 20:12

Copilot started reviewing on behalf of mfaferek93 July 4, 2026 20:13 View session

mfaferek93 self-assigned this Jul 4, 2026

Copilot AI reviewed Jul 4, 2026

View reviewed changes

Comment thread src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(opcua): make secured A&C test deterministic under load#504

test(opcua): make secured A&C test deterministic under load#504
mfaferek93 wants to merge 2 commits into
mainfrom
fix/opcua-secured-test-determinism

mfaferek93 commented Jul 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mfaferek93 commented Jul 4, 2026

Problem

Root cause (a startup ordering race, not a timeout)

Fix

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants