Skip to content

PHOENIX-7870 :- Per-HA-group poller futures and url1/url2 alternation in GetClusterRoleRecordUtil#2490

Open
lokiore wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7870-poller-bug-fixes
Open

PHOENIX-7870 :- Per-HA-group poller futures and url1/url2 alternation in GetClusterRoleRecordUtil#2490
lokiore wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7870-poller-bug-fixes

Conversation

@lokiore
Copy link
Copy Markdown
Contributor

@lokiore lokiore commented May 28, 2026

What changes were proposed in this pull request?

Two correctness fixes in GetClusterRoleRecordUtil's non-active CRR poller infrastructure:

Bug 1 — Per-HA-group future tracking. The previous implementation kept a single static volatile ScheduledFuture<?> pollerFuture field that was overwritten by every schedulePoller(...) invocation regardless of haGroupName. When the active-CRR detection branch later cancelled pollerFuture, it cancelled whichever future had been scheduled most recently — which could belong to a different HA group than the one whose lambda was running. Replaced with a ConcurrentHashMap<String, ScheduledFuture<?>> futureMap keyed by haGroupName. Symmetric handling for the pre-existing schedulerMap (now final, removed from the map on the active-CRR cancel path).

Bug 2 — url1/url2 alternation each tick. The previous implementation pinned each scheduled poller to the single URL passed in at schedule time. If that cluster's RegionServer Endpoint became transiently unreachable, the poller could never observe the peer cluster's CRR — even after the peer became Active. The poller now alternates between url1 and url2 each tick (even ticks → url1, odd ticks → url2). A failed tick still increments the counter so alternation continues uninterrupted on the next iteration.

Method signatures updated: fetchClusterRoleRecord(url1, url2, primaryUrl, haGroupName, ...) and schedulePoller(url1, url2, haGroupName, ...). Caller sites in HighAvailabilityGroup.getClusterRoleRecordFromEndpoint updated to pass both URLs explicitly while preserving existing per-call-site primary-URL ordering.

JIRA: https://issues.apache.org/jira/browse/PHOENIX-7870

Why are the changes needed?

Both bugs surface in deployments where multiple HA groups are configured against the same JVM, or where one of the two clusters' RegionServer Endpoints experiences a transient outage. Bug 1 can cancel an unrelated HA group's poller (silent failure of the cancelled group's recovery loop). Bug 2 can stall non-active CRR detection indefinitely if the polled URL's cluster is the one having issues, even when the peer cluster has already become Active.

Does this PR introduce any user-facing change?

No

The only signature changes are on package-private internal methods (schedulePoller) and on the public utility entry-point fetchClusterRoleRecord which is consumed only by HighAvailabilityGroup (within phoenix-core-client). External consumers do not call these directly.

How was this patch tested?

New unit test class GetClusterRoleRecordUtilTest (4 tests, all PASS):

  • testSelectUrlForTickAlternates — verifies even/odd alternation across the first six ticks
  • testSelectUrlForTickHandlesLargeTickValues — guards against sign issues at large tick values including Long.MAX_VALUE
  • testFutureMapIsolatesEntriesPerHaGroup — verifies distinct HA groups produce distinct future-map entries (Bug 1 invariant)
  • testCancelOneHaGroupDoesNotCancelOthers — verifies cancelling one HA group's poller leaves peers untouched (Bug 1 behavioural invariant)

Local commands run on PHOENIX-7562-feature-new HEAD:

mvn install -DskipTests                                                # full repo install — BUILD SUCCESS
mvn -pl phoenix-core-client compile                                    # prod-only compile — BUILD SUCCESS
mvn -pl phoenix-core test -Dtest=GetClusterRoleRecordUtilTest          # Tests run: 4, Failures: 0, Errors: 0, Skipped: 0

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

… in GetClusterRoleRecordUtil

Bug 1 — Per-HA-group future tracking
  Replaces the single static volatile pollerFuture field (which was overwritten
  on every schedulePoller call regardless of haGroupName, so cancelling one HA
  group's poller would target whichever future was scheduled most recently —
  possibly belonging to a different HA group) with a ConcurrentHashMap<String,
  ScheduledFuture<?>> keyed by haGroupName. Symmetric handling for the existing
  schedulerMap (now also removed from the map on the active-CRR cancel path).

Bug 2 — url1/url2 alternation each tick
  Replaces the single-URL poller (which would stall progress if its target
  cluster's RegionServer Endpoint became transiently unreachable while the
  peer cluster held the Active role) with even/odd-tick alternation between
  url1 and url2. Method signatures updated: fetchClusterRoleRecord and
  schedulePoller now accept both URLs explicitly.

Generated-by: Claude Code (Opus 4.7)
@lokiore lokiore force-pushed the PHOENIX-7870-poller-bug-fixes branch from 1782804 to 13b6500 Compare May 28, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant