Skip to content

fix: widen ActorSelection SendConsistencySpec timeout to 60s#3090

Closed
He-Pin wants to merge 1 commit into
mainfrom
fix/actor-selection-consistency-timeout
Closed

fix: widen ActorSelection SendConsistencySpec timeout to 60s#3090
He-Pin wants to merge 1 commit into
mainfrom
fix/actor-selection-consistency-timeout

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 19, 2026

Copy link
Copy Markdown
Member

Motivation

The "must be able to send messages with actorSelection concurrently preserving order" test flakes on CI: 4 sender actors each drive 1000 round trips via ActorSelection (4004 messages total), wrapped in ActorSelectionMessage with remote-side path traversal overhead. On CI with -Dpekko.test.timefactor=2, within(30.seconds) dilates to 60s which is insufficient under load — 3 of 4 senders complete but the 4th times out waiting for the final success2 message.

Investigation confirmed no recent code change caused this:

  • Artery does not use stage actors → LazyDispatch (ba4e950) has zero impact
  • Materializer wiring optimization (03ebaf5) affects materialization only, not runtime message flow
  • GraphInterpreter pendingFinalization (a505843) is a hot-path optimization, no behavioral change

Modification

Bump within(30.seconds)within(60.seconds) at line 219 for the ActorSelection test only. The ActorRef variant (line 179) is unchanged as it passes within the original budget.

Result

Up to 120s of dilated wall-clock headroom on CI for the ActorSelection variant; still fails fast on genuine deadlocks.

Tests

  • sbt "remote / Test / compile" — passes
  • CI Check / Test will exercise the artery variants

References

Refs #3041 (previous timeout widening from 10s to 30s)

Motivation:
The "must be able to send messages with actorSelection concurrently
preserving order" test flakes on CI: 4 sender actors each drive 1000
round trips via ActorSelection (4004 messages total), wrapped in
ActorSelectionMessage with remote-side path traversal overhead. On CI
with -Dpekko.test.timefactor=2, within(30.seconds) dilates to 60s
which is insufficient under load — 3 of 4 senders complete but the
4th times out waiting for the final success2 message.

Modification:
Bump within(30.seconds) to within(60.seconds) at line 219 for the
ActorSelection test only. The ActorRef variant (line 179) is unchanged
as it passes within the original budget.

Result:
Up to 120s of dilated wall-clock headroom on CI for the ActorSelection
variant; still fails fast on genuine deadlocks.

Tests:
- sbt "remote / Test / compile" — passes
- CI Check / Test will exercise the artery variants

References:
Refs #3041 (previous timeout widening from 10s to 30s)
@He-Pin He-Pin requested a review from pjfanning June 19, 2026 09:39
@pjfanning

Copy link
Copy Markdown
Member

test now failing in scala 3.3

@He-Pin

He-Pin commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

Closing in favor of #3092 which fixes the root cause.

The timeout widening only masks the symptom. The actual issue is that selectQueue in Association.send uses the anchor's UID (root guardian, always 0) as the distribution key for outbound lanes. With outbound-lanes > 1, all ActorSelection messages concentrate on a single lane (math.abs(0 % N) = 0), creating a throughput bottleneck.

#3092 distributes ActorSelection messages across lanes based on their target path hash instead, eliminating the structural bottleneck.

@He-Pin He-Pin closed this Jun 19, 2026
He-Pin added a commit that referenced this pull request Jun 19, 2026
… by target path

Motivation:
With multi-lane artery config (outbound-lanes > 1), all ActorSelection
messages were routed to the same outbound queue because selectQueue used
the anchor's UID (root guardian, always 0) as the distribution key:
math.abs(0 % N) = 0 for any N. This concentrated all ActorSelection
traffic on a single lane, creating a throughput bottleneck.

Modification:
Handle ActorSelectionMessage in a dedicated case that distributes across
lanes based on the target path elements hash instead of the anchor's UID.
PriorityMessage ActorSelection (cluster heartbeats) continues to use the
control queue. Uses (hash & Int.MaxValue) to guard against
Integer.MIN_VALUE producing a negative queue index.

Result:
ActorSelection messages are distributed across all outbound lanes by
target path. Per-path message ordering is preserved (same path → same
lane). PriorityMessage routing and all other message types are unaffected.

Tests:
- sbt "remote / Test / compile" — passes
- sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5
- CI will exercise the artery variants

References:
Refs #3041, supersedes #3090
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants