Skip to content

test(opcua): make secured A&C test deterministic under load#504

Open
mfaferek93 wants to merge 2 commits into
mainfrom
fix/opcua-secured-test-determinism
Open

test(opcua): make secured A&C test deterministic under load#504
mfaferek93 wants to merge 2 commits into
mainfrom
fix/opcua-secured-test-determinism

Conversation

@mfaferek93

Copy link
Copy Markdown
Collaborator

Problem

test_opcua_secured (secured Alarms and Conditions integration test) has been
flaking on loaded runners (Unit tests lyrical, build-and-test humble),
intermittently failing with "alarm did not surface as a CONFIRMED fault" or a
hard ctest timeout. Because it runs in the shared suite, the flake was failing
CI on unrelated open-core PRs this cycle. An earlier change bumped the
fault-surface deadline 40s to 120s and it still flaked, which is the tell that
the alarm report is genuinely lost, not merely slow.

Root cause (a startup ordering race, not a timeout)

The OPC-UA plugin creates its ReportFault service client and then immediately
starts the poller that detects AlarmCondition state and reports faults. The
alarm report is a one-shot, fire-and-forget async request:

  • An AlarmCondition surfaces a fault exactly once, on the inactive to active
    transition; re-observations of the same active state are suppressed as NoOp.
  • report_fault is sent with async_send_request and never retried.

So if the poller emits that report before the client has matched fault_manager
over DDS, the request is dropped with no matched service and the alarm never
surfaces. DDS discovery lag is worst exactly when the host is under load, which
is why the flake concentrated on loaded runners. Widening the deadline cannot
help once the single report has been dropped.

Fix

Wait (bounded, 10s) for the ReportFault service to be discovered before
starting the poller, so the first report cannot be emitted before the fault
sink is matched. This is a deterministic ordering fix, not a bigger timeout. It
is also a production robustness improvement: an alarm that is already active at
startup (replayed via ConditionRefresh) or one that fires immediately is no
longer silently lost. The wait is best-effort so a deployment without a
fault_manager still starts after the timeout.

Verification

Deterministic causal reproduction (the alarm is active before the gateway
connects and fault_manager is started 7s late, so the report is emitted during
startup replay, before discovery):

  • Without the fix: the alarm is lost 4/4. The gateway logs poller started
    ~1ms after connect and AlarmCondition CONFIRMED ~0.5s after connect, i.e.
    the report is emitted well before fault_manager exists, then dropped.
  • With the fix: the alarm surfaces 4/4. poller started moves to right after
    fault_manager is discovered (connect + ~6.9s), so the report lands.

Stress run: the real (unmodified) test_opcua_secured was run 15 consecutive
times on a 2-CPU-constrained container under additional CPU load: 15/15 passed.
The opcua package unit tests and linters also pass unchanged (the wait is
skipped when the plugin has no ROS node, as in the unit tests).

The secured Alarms and Conditions integration test flaked on loaded runners
because the plugin started its poller before the ReportFault service client had
matched fault_manager over DDS. AlarmCondition notifications are one-shot and
the report is a fire-and-forget async request, so one emitted before that match
is dropped with no retry and the alarm never surfaces. Wait (bounded) for the
fault sink to be discovered before starting the poller so the first report
cannot be lost.
Copilot AI review requested due to automatic review settings July 4, 2026 20:12
@mfaferek93 mfaferek93 self-assigned this Jul 4, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the secured OPC-UA Alarms & Conditions integration path deterministic under load by eliminating a startup ordering race between the OPC-UA poller’s one-shot alarm reporting and DDS discovery of the fault_manager ReportFault service.

Changes:

  • Add a bounded (10s) best-effort wait for /fault_manager/report_fault discovery before starting the OPC-UA poller.
  • Warn when the fault sink is not discovered within the timeout so operators understand alarms may be dropped until it appears.

Comment thread src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp Outdated
The wait_for_service timeout was duplicated as a literal 10s in the
warning text, so changing it would leave the log stale. Extract it into
a named constant and build the message from it, keeping value and text
in sync. No behaviour change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants