Skip to content

fix(artery): distribute ActorSelection messages across outbound lanes by target path#3092

Draft
He-Pin wants to merge 7 commits into
mainfrom
fix/actor-selection-queue-distribution
Draft

fix(artery): distribute ActorSelection messages across outbound lanes by target path#3092
He-Pin wants to merge 7 commits into
mainfrom
fix/actor-selection-queue-distribution

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 19, 2026

Copy link
Copy Markdown
Member

Motivation

The "must be able to send messages with actorSelection concurrently preserving order" test flakes on CI: 4 sender actors each drive 1000 round-trips via ActorSelection (4004 messages total). With multi-lane artery config (outbound-lanes > 1), all ActorSelection messages are routed to the same outbound queue because selectQueue uses the anchor's UID as the distribution key:

OrdinaryQueueIndex + (math.abs(r.path.uid % outboundLanes))

The anchor for ActorSelection is the root guardian (RootActorPath), whose UID is always 0 (ActorCell.undefinedUid). So math.abs(0 % N) = 0 for any N — all ActorSelection traffic concentrates on lane 0 while other lanes sit idle.

In contrast, the ActorRef variant distributes across lanes because each echo actor has a distinct non-zero UID.

This is not a recent regression — the selectQueue logic has been unchanged since the Pekko fork from Akka. PR #3090 (timeout widening) masks the symptom but doesn't fix the structural bottleneck.

Modification

Add a dedicated case sel: ActorSelectionMessage (non-PriorityMessage) in Association.send that computes the queue index from the selection's target path elements hash instead of the anchor's UID:

case sel: ActorSelectionMessage =>
  val queueIndex =
    if (outboundLanes == 1) OrdinaryQueueIndex
    else OrdinaryQueueIndex + (math.abs(sel.elements.hashCode()) % outboundLanes)
  val queue = queues(queueIndex)
  if (!queue.offer(outboundEnvelope))
    dropped(queueIndex, queueSize, outboundEnvelope)

This distributes ActorSelection messages across lanes by their target path while preserving per-path message ordering (same target path → same hash → same lane).

PriorityMessage ActorSelection (used by cluster heartbeats) continues to go through the control queue unchanged.

Result

ActorSelection messages are now distributed across all outbound lanes based on target path, eliminating the single-lane throughput bottleneck. The existing ActorRef-based test and PriorityMessage routing are unaffected.

Tests

  • sbt "remote / Test / compile" — compile check
  • CI will exercise the artery variants (1-lane and 3-lane configs)

References

Refs #3041 (previous timeout widening), supersedes #3090

Comment thread remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala Outdated
@He-Pin He-Pin force-pushed the fix/actor-selection-queue-distribution branch from a57a695 to d3392c9 Compare June 19, 2026 12:37
He-Pin added a commit that referenced this pull request Jun 19, 2026
Verify that ActorSelection messages are distributed across outbound
lanes based on target path hash:
- Different target paths map to different lanes
- Queue indices are always non-negative (Integer.MIN_VALUE safe)
- Same target path always maps to the same lane (ordering preserved)
- Single lane config works correctly
- Paths don't all concentrate on lane 0 (original bug regression test)

Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 passed

References: Refs #3092
He-Pin added a commit that referenced this pull request Jun 19, 2026
Verify that ActorSelection messages are distributed across outbound
lanes based on target path hash:
- Different target paths map to different lanes
- Queue indices are always non-negative (Integer.MIN_VALUE safe)
- Same target path always maps to the same lane (ordering preserved)
- Single lane config works correctly
- Paths don't all concentrate on lane 0 (original bug regression test)

Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 passed

References: Refs #3092
@He-Pin He-Pin force-pushed the fix/actor-selection-queue-distribution branch from 1ee55e3 to 60684b4 Compare June 19, 2026 12:54
… by target path

Motivation:
With multi-lane artery config (outbound-lanes > 1), all ActorSelection
messages were routed to the same outbound queue because selectQueue used
the anchor's UID (root guardian, always 0) as the distribution key:
math.abs(0 % N) = 0 for any N. This concentrated all ActorSelection
traffic on a single lane, creating a throughput bottleneck.

Modification:
Handle ActorSelectionMessage in a dedicated case that distributes across
lanes based on the target path elements hash instead of the anchor's UID.
PriorityMessage ActorSelection (cluster heartbeats) continues to use the
control queue. Uses (hash & Int.MaxValue) to guard against
Integer.MIN_VALUE producing a negative queue index.

Result:
ActorSelection messages are distributed across all outbound lanes by
target path. Per-path message ordering is preserved (same path → same
lane). PriorityMessage routing and all other message types are unaffected.

Tests:
- sbt "remote / Test / compile" — passes
- sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5
- CI will exercise the artery variants

References:
Refs #3041, supersedes #3090
He-Pin added a commit that referenced this pull request Jun 19, 2026
Verify that ActorSelection messages are distributed across outbound
lanes based on target path hash:
- Different target paths map to different lanes
- Queue indices are always non-negative (Integer.MIN_VALUE safe)
- Same target path always maps to the same lane (ordering preserved)
- Single lane config works correctly
- Paths don't all concentrate on lane 0 (regression test for original bug)

Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5

References: Refs #3092
@He-Pin He-Pin force-pushed the fix/actor-selection-queue-distribution branch from 60684b4 to 1b82391 Compare June 19, 2026 12:55
@pjfanning

Copy link
Copy Markdown
Member

#3089 test is failing in this PR - we appear to have to broken this with some change over the last week or 2. This test is now very flaky and I don't recall it being an issue before

Verify that ActorSelection messages are distributed across outbound
lanes based on target path hash:
- Different target paths map to different lanes
- Queue indices are always non-negative (Integer.MIN_VALUE safe)
- Same target path always maps to the same lane (ordering preserved)
- Single lane config works correctly
- Paths don't all concentrate on lane 0 (regression test for original bug)

Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5

References: Refs #3092
@He-Pin He-Pin force-pushed the fix/actor-selection-queue-distribution branch from 1b82391 to a07f152 Compare June 19, 2026 13:38
@He-Pin He-Pin marked this pull request as draft June 19, 2026 13:46
The ActorSelection message path has inherent per-message overhead:
synchronous deliverSelection on the inbound thread (path resolution),
MessageContainerSerializer (larger payload), and path element encoding.

With 1-lane configs (especially TLS-TCP), 1000 round-trips × 4 senders
creates sustained pressure on the single inbound thread pipeline,
causing the test to exceed the 30s timeout on CI.

100 round-trips still validates ordering and concurrency while being
proportional to the overhead difference vs the ActorRef variant.

Tests: sbt "remote / Test / compile" — passes
References: Refs #3092
He-Pin added 2 commits June 20, 2026 00:05
Motivation: PR #3092 distributes ActorSelection messages across outbound lanes, but inbound lane selection still used the wire recipient. For ActorSelection that recipient is typically the root guardian, so messages from one origin still concentrate on one inbound lane.

Modification: Partition inbound ActorSelection envelopes by the selected target path encoded in the SelectionEnvelope while preserving same target path and origin ordering. Add deterministic outbound and inbound lane distribution coverage. RemoteSendConsistencySpec is intentionally unchanged.

Result: ActorSelection traffic can use multiple inbound lanes without weakening same-target ordering or changing the wire protocol.

Tests: sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ActorSelectionQueueDistributionSpec"; sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithThreeLanesSpec -- -z actorSelection"; scalafmt --mode diff-ref=origin/main; scalafmt --list --mode diff-ref=origin/main; git diff --check.

References: Refs #3092
Motivation: The ActorSelection lane distribution fix should not rely on changing RemoteSendConsistencySpec.

Modification: Restore RemoteSendConsistencySpec to the main branch version so it is removed from the PR diff.

Result: The PR keeps the existing send consistency spec unchanged while retaining the inbound ActorSelection lane distribution fix and focused distribution coverage.

Tests: git diff --check; focused tests were run before this restore: sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ActorSelectionQueueDistributionSpec" and sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithThreeLanesSpec -- -z actorSelection".

References: Refs #3092
@pjfanning

Copy link
Copy Markdown
Member

#3089 is happening for tests that don't use actor selection but maybe not quite as often.

Motivation:
Inbound ActorSelection lane partitioning needs the selected target path hash, but parsing the full SelectionEnvelope in the inbound hot path adds avoidable allocation and CPU cost.

Modification:
Scan only the SelectionEnvelope pattern fields with CodedInputStream when computing the lane hash. Skip unknown fields and fall back to the existing recipient uid hash if scanning fails. Add coverage for unknown fields so the read-side hash remains tolerant of rolling-upgrade wire data.

Result:
ActorSelection lane distribution keeps the same wire format and avoids full protobuf object construction before normal deserialization.

Tests:
- sbt 'remote / Test / testOnly org.apache.pekko.remote.artery.ActorSelectionQueueDistributionSpec' (passed, 11 tests)
- git diff --check (passed)

References:
Refs #3092
…torSelection lane distribution

Motivation:
The existing ActorSelectionQueueDistributionSpec only tested
SelectChildName elements. SelectParent (type=0, no matcher) and
SelectChildPattern (type=2, with matcher) were untested, leaving gaps
in the CodedInputStream-based protobuf parser coverage.

Modification:
Add 4 new test cases verifying hash consistency between ByteBuffer and
SelectionEnvelope parsing for SelectParent and SelectChildPattern, and
that different SelectionPathElement types produce distinct hashes.

Result:
15/15 tests pass, covering all three SelectionPathElement variants.

Tests:
sbt "remote / Test / testOnly
  *ActorSelectionQueueDistributionSpec" — 15/15 passed

References:
Refs #3092
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants