PHOENIX-7870 :- Per-HA-group poller futures and url1/url2 alternation in GetClusterRoleRecordUtil#2490
Open
lokiore wants to merge 1 commit into
Open
Conversation
… in GetClusterRoleRecordUtil Bug 1 — Per-HA-group future tracking Replaces the single static volatile pollerFuture field (which was overwritten on every schedulePoller call regardless of haGroupName, so cancelling one HA group's poller would target whichever future was scheduled most recently — possibly belonging to a different HA group) with a ConcurrentHashMap<String, ScheduledFuture<?>> keyed by haGroupName. Symmetric handling for the existing schedulerMap (now also removed from the map on the active-CRR cancel path). Bug 2 — url1/url2 alternation each tick Replaces the single-URL poller (which would stall progress if its target cluster's RegionServer Endpoint became transiently unreachable while the peer cluster held the Active role) with even/odd-tick alternation between url1 and url2. Method signatures updated: fetchClusterRoleRecord and schedulePoller now accept both URLs explicitly. Generated-by: Claude Code (Opus 4.7)
1782804 to
13b6500
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Two correctness fixes in
GetClusterRoleRecordUtil's non-active CRR poller infrastructure:Bug 1 — Per-HA-group future tracking. The previous implementation kept a single
static volatile ScheduledFuture<?> pollerFuturefield that was overwritten by everyschedulePoller(...)invocation regardless ofhaGroupName. When the active-CRR detection branch later cancelledpollerFuture, it cancelled whichever future had been scheduled most recently — which could belong to a different HA group than the one whose lambda was running. Replaced with aConcurrentHashMap<String, ScheduledFuture<?>> futureMapkeyed byhaGroupName. Symmetric handling for the pre-existingschedulerMap(nowfinal, removed from the map on the active-CRR cancel path).Bug 2 — url1/url2 alternation each tick. The previous implementation pinned each scheduled poller to the single URL passed in at schedule time. If that cluster's RegionServer Endpoint became transiently unreachable, the poller could never observe the peer cluster's CRR — even after the peer became Active. The poller now alternates between
url1andurl2each tick (even ticks →url1, odd ticks →url2). A failed tick still increments the counter so alternation continues uninterrupted on the next iteration.Method signatures updated:
fetchClusterRoleRecord(url1, url2, primaryUrl, haGroupName, ...)andschedulePoller(url1, url2, haGroupName, ...). Caller sites inHighAvailabilityGroup.getClusterRoleRecordFromEndpointupdated to pass both URLs explicitly while preserving existing per-call-site primary-URL ordering.JIRA: https://issues.apache.org/jira/browse/PHOENIX-7870
Why are the changes needed?
Both bugs surface in deployments where multiple HA groups are configured against the same JVM, or where one of the two clusters' RegionServer Endpoints experiences a transient outage. Bug 1 can cancel an unrelated HA group's poller (silent failure of the cancelled group's recovery loop). Bug 2 can stall non-active CRR detection indefinitely if the polled URL's cluster is the one having issues, even when the peer cluster has already become Active.
Does this PR introduce any user-facing change?
No
The only signature changes are on package-private internal methods (
schedulePoller) and on the public utility entry-pointfetchClusterRoleRecordwhich is consumed only byHighAvailabilityGroup(within phoenix-core-client). External consumers do not call these directly.How was this patch tested?
New unit test class
GetClusterRoleRecordUtilTest(4 tests, all PASS):testSelectUrlForTickAlternates— verifies even/odd alternation across the first six tickstestSelectUrlForTickHandlesLargeTickValues— guards against sign issues at large tick values includingLong.MAX_VALUEtestFutureMapIsolatesEntriesPerHaGroup— verifies distinct HA groups produce distinct future-map entries (Bug 1 invariant)testCancelOneHaGroupDoesNotCancelOthers— verifies cancelling one HA group's poller leaves peers untouched (Bug 1 behavioural invariant)Local commands run on
PHOENIX-7562-feature-newHEAD:Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)