fix(artery): distribute ActorSelection messages across outbound lanes by target path#3092
Draft
He-Pin wants to merge 7 commits into
Draft
fix(artery): distribute ActorSelection messages across outbound lanes by target path#3092He-Pin wants to merge 7 commits into
He-Pin wants to merge 7 commits into
Conversation
pjfanning
reviewed
Jun 19, 2026
He-Pin
commented
Jun 19, 2026
a57a695 to
d3392c9
Compare
He-Pin
added a commit
that referenced
this pull request
Jun 19, 2026
Verify that ActorSelection messages are distributed across outbound lanes based on target path hash: - Different target paths map to different lanes - Queue indices are always non-negative (Integer.MIN_VALUE safe) - Same target path always maps to the same lane (ordering preserved) - Single lane config works correctly - Paths don't all concentrate on lane 0 (original bug regression test) Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 passed References: Refs #3092
He-Pin
added a commit
that referenced
this pull request
Jun 19, 2026
Verify that ActorSelection messages are distributed across outbound lanes based on target path hash: - Different target paths map to different lanes - Queue indices are always non-negative (Integer.MIN_VALUE safe) - Same target path always maps to the same lane (ordering preserved) - Single lane config works correctly - Paths don't all concentrate on lane 0 (original bug regression test) Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 passed References: Refs #3092
1ee55e3 to
60684b4
Compare
… by target path Motivation: With multi-lane artery config (outbound-lanes > 1), all ActorSelection messages were routed to the same outbound queue because selectQueue used the anchor's UID (root guardian, always 0) as the distribution key: math.abs(0 % N) = 0 for any N. This concentrated all ActorSelection traffic on a single lane, creating a throughput bottleneck. Modification: Handle ActorSelectionMessage in a dedicated case that distributes across lanes based on the target path elements hash instead of the anchor's UID. PriorityMessage ActorSelection (cluster heartbeats) continues to use the control queue. Uses (hash & Int.MaxValue) to guard against Integer.MIN_VALUE producing a negative queue index. Result: ActorSelection messages are distributed across all outbound lanes by target path. Per-path message ordering is preserved (same path → same lane). PriorityMessage routing and all other message types are unaffected. Tests: - sbt "remote / Test / compile" — passes - sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 - CI will exercise the artery variants References: Refs #3041, supersedes #3090
He-Pin
added a commit
that referenced
this pull request
Jun 19, 2026
Verify that ActorSelection messages are distributed across outbound lanes based on target path hash: - Different target paths map to different lanes - Queue indices are always non-negative (Integer.MIN_VALUE safe) - Same target path always maps to the same lane (ordering preserved) - Single lane config works correctly - Paths don't all concentrate on lane 0 (regression test for original bug) Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 References: Refs #3092
60684b4 to
1b82391
Compare
pjfanning
reviewed
Jun 19, 2026
Member
|
#3089 test is failing in this PR - we appear to have to broken this with some change over the last week or 2. This test is now very flaky and I don't recall it being an issue before |
Verify that ActorSelection messages are distributed across outbound lanes based on target path hash: - Different target paths map to different lanes - Queue indices are always non-negative (Integer.MIN_VALUE safe) - Same target path always maps to the same lane (ordering preserved) - Single lane config works correctly - Paths don't all concentrate on lane 0 (regression test for original bug) Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 5/5 References: Refs #3092
1b82391 to
a07f152
Compare
The ActorSelection message path has inherent per-message overhead: synchronous deliverSelection on the inbound thread (path resolution), MessageContainerSerializer (larger payload), and path element encoding. With 1-lane configs (especially TLS-TCP), 1000 round-trips × 4 senders creates sustained pressure on the single inbound thread pipeline, causing the test to exceed the 30s timeout on CI. 100 round-trips still validates ordering and concurrency while being proportional to the overhead difference vs the ActorRef variant. Tests: sbt "remote / Test / compile" — passes References: Refs #3092
pjfanning
reviewed
Jun 19, 2026
Motivation: PR #3092 distributes ActorSelection messages across outbound lanes, but inbound lane selection still used the wire recipient. For ActorSelection that recipient is typically the root guardian, so messages from one origin still concentrate on one inbound lane. Modification: Partition inbound ActorSelection envelopes by the selected target path encoded in the SelectionEnvelope while preserving same target path and origin ordering. Add deterministic outbound and inbound lane distribution coverage. RemoteSendConsistencySpec is intentionally unchanged. Result: ActorSelection traffic can use multiple inbound lanes without weakening same-target ordering or changing the wire protocol. Tests: sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ActorSelectionQueueDistributionSpec"; sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithThreeLanesSpec -- -z actorSelection"; scalafmt --mode diff-ref=origin/main; scalafmt --list --mode diff-ref=origin/main; git diff --check. References: Refs #3092
Motivation: The ActorSelection lane distribution fix should not rely on changing RemoteSendConsistencySpec. Modification: Restore RemoteSendConsistencySpec to the main branch version so it is removed from the PR diff. Result: The PR keeps the existing send consistency spec unchanged while retaining the inbound ActorSelection lane distribution fix and focused distribution coverage. Tests: git diff --check; focused tests were run before this restore: sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ActorSelectionQueueDistributionSpec" and sbt "remote / Test / testOnly org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryUpdSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTcpSendConsistencyWithThreeLanesSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithOneLaneSpec org.apache.pekko.remote.artery.ArteryTlsTcpSendConsistencyWithThreeLanesSpec -- -z actorSelection". References: Refs #3092
Member
|
#3089 is happening for tests that don't use actor selection but maybe not quite as often. |
Motivation: Inbound ActorSelection lane partitioning needs the selected target path hash, but parsing the full SelectionEnvelope in the inbound hot path adds avoidable allocation and CPU cost. Modification: Scan only the SelectionEnvelope pattern fields with CodedInputStream when computing the lane hash. Skip unknown fields and fall back to the existing recipient uid hash if scanning fails. Add coverage for unknown fields so the read-side hash remains tolerant of rolling-upgrade wire data. Result: ActorSelection lane distribution keeps the same wire format and avoids full protobuf object construction before normal deserialization. Tests: - sbt 'remote / Test / testOnly org.apache.pekko.remote.artery.ActorSelectionQueueDistributionSpec' (passed, 11 tests) - git diff --check (passed) References: Refs #3092
…torSelection lane distribution Motivation: The existing ActorSelectionQueueDistributionSpec only tested SelectChildName elements. SelectParent (type=0, no matcher) and SelectChildPattern (type=2, with matcher) were untested, leaving gaps in the CodedInputStream-based protobuf parser coverage. Modification: Add 4 new test cases verifying hash consistency between ByteBuffer and SelectionEnvelope parsing for SelectParent and SelectChildPattern, and that different SelectionPathElement types produce distinct hashes. Result: 15/15 tests pass, covering all three SelectionPathElement variants. Tests: sbt "remote / Test / testOnly *ActorSelectionQueueDistributionSpec" — 15/15 passed References: Refs #3092
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The "must be able to send messages with actorSelection concurrently preserving order" test flakes on CI: 4 sender actors each drive 1000 round-trips via
ActorSelection(4004 messages total). With multi-lane artery config (outbound-lanes > 1), all ActorSelection messages are routed to the same outbound queue becauseselectQueueuses the anchor's UID as the distribution key:The anchor for
ActorSelectionis the root guardian (RootActorPath), whose UID is always0(ActorCell.undefinedUid). Somath.abs(0 % N) = 0for any N — all ActorSelection traffic concentrates on lane 0 while other lanes sit idle.In contrast, the ActorRef variant distributes across lanes because each echo actor has a distinct non-zero UID.
This is not a recent regression — the
selectQueuelogic has been unchanged since the Pekko fork from Akka. PR #3090 (timeout widening) masks the symptom but doesn't fix the structural bottleneck.Modification
Add a dedicated
case sel: ActorSelectionMessage(non-PriorityMessage) inAssociation.sendthat computes the queue index from the selection's target path elements hash instead of the anchor's UID:This distributes ActorSelection messages across lanes by their target path while preserving per-path message ordering (same target path → same hash → same lane).
PriorityMessage ActorSelection (used by cluster heartbeats) continues to go through the control queue unchanged.
Result
ActorSelection messages are now distributed across all outbound lanes based on target path, eliminating the single-lane throughput bottleneck. The existing ActorRef-based test and PriorityMessage routing are unaffected.
Tests
sbt "remote / Test / compile"— compile checkReferences
Refs #3041 (previous timeout widening), supersedes #3090