Skip to content

test(e2e): add WaitForReplication after first switchover/failover#390

Open
tsivaprasad wants to merge 1 commit into
mainfrom
PLAT-612-enhance-failover-switchover-e-2-e-test-to-validate-replication
Open

test(e2e): add WaitForReplication after first switchover/failover#390
tsivaprasad wants to merge 1 commit into
mainfrom
PLAT-612-enhance-failover-switchover-e-2-e-test-to-validate-replication

Conversation

@tsivaprasad
Copy link
Copy Markdown
Contributor

@tsivaprasad tsivaprasad commented May 21, 2026

Summary

This PR adds a db.WaitForReplication call after the first automatic switchover in TestSwitchoverScenarios to verify replication recovery after a topology change, and also adds the same check after the first automatic failover in TestFailoverScenarios.

Changes

  • Add db.WaitForReplication after the first automatic switchover in TestSwitchoverScenarios
  • Add db.WaitForReplication after the first automatic failover in TestFailoverScenarios

Testing

Verification:

make test-e2e E2E_FIXTURE=lima E2E_RUN=TestSwitchoverScenarios                      
/Users/sivat/go/bin/gotestsum  \
		--format-hide-empty-pkg \
		--format standard-verbose \
		--rerun-fails=0 \
		--rerun-fails-max-failures=4 \
		--packages='./e2e/...' \
		-- \
		-tags=e2e_test -count=1 -timeout=45m -parallel 4 -run TestSwitchoverScenarios -args -fixture lima   
2026/05/25 11:28:35 initializing cluster
2026/05/25 11:28:35 cluster initialized
=== RUN   TestSwitchoverScenarios
=== PAUSE TestSwitchoverScenarios
=== CONT  TestSwitchoverScenarios
=== RUN   TestSwitchoverScenarios/automatic_candidate_selection_(no_candidate_specified)
    switchover_test.go:130: [auto] original primary: df0c6351-b081-444a-85d8-54600366e28a-n1-689qacsi
    switchover_test.go:142: [auto] new primary: df0c6351-b081-444a-85d8-54600366e28a-n1-9ptayhma
    database_test.go:357: [TestSwitchoverScenarios/automatic_candidate_selection_(no_candidate_specified)] waiting for replication to catch up on all nodes
--- PASS: TestSwitchoverScenarios/automatic_candidate_selection_(no_candidate_specified) (26.03s)
=== RUN   TestSwitchoverScenarios/switchover_to_a_specific_candidate
    switchover_test.go:153: [specific] current primary: df0c6351-b081-444a-85d8-54600366e28a-n1-9ptayhma, candidate (ready): df0c6351-b081-444a-85d8-54600366e28a-n1-689qacsi
    switchover_test.go:165: [specific] new primary confirmed: df0c6351-b081-444a-85d8-54600366e28a-n1-689qacsi
--- PASS: TestSwitchoverScenarios/switchover_to_a_specific_candidate (26.03s)
=== RUN   TestSwitchoverScenarios/scheduled_switchover
    switchover_test.go:188: [scheduled] hostNow=2026-05-25T06:00:29Z serverNow=2026-05-25T05:57:38Z skew=-2m51s scheduled_at=2026-05-25T05:59:38Z (delta=-51s)
    switchover_test.go:211: [scheduled] new primary confirmed: df0c6351-b081-444a-85d8-54600366e28a-n1-9ptayhma
--- PASS: TestSwitchoverScenarios/scheduled_switchover (136.38s)
=== RUN   TestSwitchoverScenarios/invalid_candidate_instance
    switchover_test.go:222: [invalid] got expected error: failed to call switchover api: not_found: instance not found
--- PASS: TestSwitchoverScenarios/invalid_candidate_instance (0.00s)
=== RUN   TestSwitchoverScenarios/concurrent_switchover_requests
    switchover_test.go:247: [concurrent] second request returned expected error: failed to call switchover api: server_error: switchover already in progress for database df0c6351-b081-444a-85d8-54600366e28a node n1
--- PASS: TestSwitchoverScenarios/concurrent_switchover_requests (25.03s)
    fixture_test.go:200: cleaning up database df0c6351-b081-444a-85d8-54600366e28a
--- PASS: TestSwitchoverScenarios (289.91s)
PASS
ok  	github.com/pgEdge/control-plane/e2e	290.413s

DONE 6 tests in 290.413s
make test-e2e E2E_FIXTURE=lima E2E_RUN=TestFailoverScenarios
/Users/sivat/go/bin/gotestsum  \
		--format-hide-empty-pkg \
		--format standard-verbose \
		--rerun-fails=0 \
		--rerun-fails-max-failures=4 \
		--packages='./e2e/...' \
		-- \
		-tags=e2e_test -count=1 -timeout=45m -parallel 4 -run TestFailoverScenarios -args -fixture lima   
2026/05/25 11:36:21 initializing cluster
2026/05/25 11:36:21 cluster initialized
=== RUN   TestFailoverScenarios
=== PAUSE TestFailoverScenarios
=== CONT  TestFailoverScenarios
=== RUN   TestFailoverScenarios/automatic_candidate_selection_(no_candidate_specified)
    failover_test.go:125: [auto] original primary: 1e06bf4a-8222-4631-8607-8ba0c20d4ea9-n1-689qacsi
    failover_test.go:140: [auto] new primary: 1e06bf4a-8222-4631-8607-8ba0c20d4ea9-n1-9ptayhma
    database_test.go:357: [TestFailoverScenarios/automatic_candidate_selection_(no_candidate_specified)] waiting for replication to catch up on all nodes
--- PASS: TestFailoverScenarios/automatic_candidate_selection_(no_candidate_specified) (26.03s)
=== RUN   TestFailoverScenarios/failover_to_a_specific_candidate
    failover_test.go:150: [specific] current primary: 1e06bf4a-8222-4631-8607-8ba0c20d4ea9-n1-9ptayhma, candidate (ready): 1e06bf4a-8222-4631-8607-8ba0c20d4ea9-n1-689qacsi
    failover_test.go:165: [specific] new primary confirmed: 1e06bf4a-8222-4631-8607-8ba0c20d4ea9-n1-689qacsi
--- PASS: TestFailoverScenarios/failover_to_a_specific_candidate (25.03s)
=== RUN   TestFailoverScenarios/invalid_candidate_instance
    failover_test.go:176: [invalid] got expected error: failed to call failover api: not_found: instance not found
--- PASS: TestFailoverScenarios/invalid_candidate_instance (0.01s)
=== RUN   TestFailoverScenarios/concurrent_failover_requests
    failover_test.go:201: [concurrent] second request returned expected error: failed to call failover api: server_error: failover already in progress for database 1e06bf4a-8222-4631-8607-8ba0c20d4ea9 node n1
--- PASS: TestFailoverScenarios/concurrent_failover_requests (6.52s)
=== RUN   TestFailoverScenarios/skip_validation_behavior
    failover_test.go:211: [skip-validation] original primary: 1e06bf4a-8222-4631-8607-8ba0c20d4ea9-n1-689qacsi
    failover_test.go:230: [skip-validation] got expected rejection without skip_validation: task status is 'failed' instead of 'completed', error=cluster is healthy; refuse failover unless skip_validation is true
--- PASS: TestFailoverScenarios/skip_validation_behavior (32.04s)
    fixture_test.go:200: cleaning up database 1e06bf4a-8222-4631-8607-8ba0c20d4ea9
--- PASS: TestFailoverScenarios (165.12s)
PASS
ok  	github.com/pgEdge/control-plane/e2e	165.615s

DONE 6 tests in 165.615s

Checklist

  • Tests added or updated (unit and/or e2e, as needed)

PLAT-612

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds replication health verification to two E2E test scenarios. Both the failover and switchover tests now wait for replication to become healthy after confirming a primary change in their automatic candidate selection flows, ensuring consistent state before subsequent assertions execute.

Changes

E2E Test Replication Health Verification

Layer / File(s) Summary
Replication health verification in automatic candidate selection tests
e2e/failover_test.go, e2e/switchover_test.go
Added db.WaitForReplication(...) calls after primary change confirmation in both "automatic candidate selection" scenarios to validate replication health before proceeding with test assertions.

Poem

🐰 A rabbit hops through tests with care,
Waiting for replication fair,
When candidates switch with grace,
Health checks ensure the right pace! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description follows the template with all required sections (Summary, Changes, Testing, Checklist) completed with clear, specific details.
Title check ✅ Passed The title accurately describes the main change: adding WaitForReplication calls after switchover/failover events in e2e tests, which aligns with the PR's objective to validate replication health.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PLAT-612-enhance-failover-switchover-e-2-e-test-to-validate-replication

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 21, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (12)
e2e/multi_node_replication_test.go (3)

19-27: 💤 Low value

25-minute deadline may be too tight for six sequential topology subtests.

The outer context is 25 minutes, and each topology subtest performs a switchover/failover followed by assertAllNodesReplicationHealthy (which can spend up to ~9 minutes per call given the per-helper 90s timeout for a two-node cluster) plus a WaitForReplication (now 120s). Under slow CI or post-topology stalls, the cumulative budget can exceed 25 minutes and the failures will surface as context-cancellation errors rather than the actual condition that hung. Consider bumping to 35–45 minutes or tightening the per-helper timeouts inside assertAllNodesReplicationHealthy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/multi_node_replication_test.go` around lines 19 - 27, The 25-minute outer
context in TestMultiNodeReplicationHealth is too short for six sequential
topology subtests (each calling assertAllNodesReplicationHealthy and
WaitForReplication); extend the context timeout created by context.WithTimeout
to 35–45 minutes (e.g., 40 minutes) or alternatively reduce the per-helper
timeouts used inside assertAllNodesReplicationHealthy and the WaitForReplication
call to keep total runtime under the existing 25-minute limit; update the
timeout value where context.WithTimeout is called in
TestMultiNodeReplicationHealth and/or adjust the helper timeout constants used
by assertAllNodesReplicationHealthy and WaitForReplication accordingly.

282-332: 💤 Low value

repl_integrity table is created but never dropped, and DDL replication is assumed.

Two concerns worth verifying:

  1. CREATE TABLE IF NOT EXISTS repl_integrity runs on the n1 primary, and the subsequent COUNT(*) runs on every node's primary including n2. This only works if your Spock setup auto-replicates DDL (e.g., spock.replicate_ddl_command is on or DDL is in a replication set). If a future configuration change disables that, this subtest will fail with a confusing "relation does not exist" instead of a clear contract violation — worth adding an explicit assertion or a comment.
  2. The table is left behind in the test database. Since the fixture is per-test it's harmless today, but if anyone later reuses the fixture across tests the ON CONFLICT upsert hides leaked state. A t.Cleanup that drops the table would make the subtest self-contained.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/multi_node_replication_test.go` around lines 282 - 332, The test creates
a repl_integrity table on n1 but never drops it and assumes DDL replication;
update the subtest to (1) register a t.Cleanup that drops the repl_integrity
table (use the same db.WithConnection + connOpts("n1") pattern and DROP TABLE IF
EXISTS repl_integrity) so the table is removed after the test, and (2) add an
explicit assertion after db.WaitForReplication that the relation exists on each
node (use db.WithConnection + connOpts(node.Name) and a simple query like SELECT
to confirm the table exists or fail with a clear message) to make the
DDL-replication contract explicit (references: repl_integrity, CREATE TABLE IF
NOT EXISTS repl_integrity, t.Cleanup, db.WithConnection, connOpts,
db.WaitForReplication).

63-71: ⚡ Quick win

Ignoring the waitFor result lets the test proceed against a still-creating/modifying cluster.

If the 90s budget elapses without all instances settling, waitFor returns false and the test silently continues into the baseline WaitForReplication and topology subtests. The downstream failure will then be obscure (timeout in some assertion deep inside a subtest) instead of a clear “cluster did not finish provisioning”. A require.Truef(t, waitFor(...), "cluster did not settle within 90s") makes the failure mode actionable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/multi_node_replication_test.go` around lines 63 - 71, The test currently
ignores the boolean result of waitFor that polls db.Refresh over db.Instances,
which can let the test proceed while instances remain "creating" or "modifying";
change the call so the test asserts the result (e.g., use require.Truef(t,
waitFor(...), "cluster did not settle within 90s")) so that if waitFor returns
false the test fails immediately with a clear message referencing the cluster
did not finish provisioning; keep the same waitFor predicate (calling
db.Refresh(ctx) and checking inst.State) and the 90*time.Second timeout but wrap
it with the require.Truef assertion using the test handle t.
e2e/database_test.go (1)

396-404: 💤 Low value

Extract the 120-second budget into a constant.

The 120 literal appears both in the wait_for_sync_event SQL argument and in the assertion message. They must stay in lockstep; a small const waitForSyncEventTimeoutSecs = 120 (or a time.Duration) used in both places would prevent silent drift if the budget is tuned again later.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/database_test.go` around lines 396 - 404, Extract the hardcoded 120 into
a named constant (e.g. waitForSyncEventTimeoutSecs or waitForSyncEventTimeout)
and use it in both the SQL call to spock.wait_for_sync_event and the assertion
message; locate the block that calls row := conn.QueryRow(ctx, "CALL
spock.wait_for_sync_event(true, $1, $2::pg_lsn, 120);", peerNode, lsn) and the
subsequent assert.Truef that prints "within 120s", replace the literal with the
constant (or a time.Duration converted to seconds for the message) so the
timeout value stays in sync across the QueryRow call and the assertion
formatting.
e2e/replication_health_test.go (3)

305-341: 💤 Low value

timeout is applied per-helper-per-node, so worst-case wall time scales with nodes × helpers × timeout.

assertAllNodesReplicationHealthy invokes three polling helpers per node, each with its own full timeout budget (currently 90s). For a 2-node cluster this is up to 2 × 3 × 90s = 9m per call, and the helper is called several times per scenario in the new multi-node test inside an outer 25m budget. Consider deriving per-helper timeouts from a shared deadline (e.g., context.WithTimeout derived from timeout) or shrinking individual budgets so a slow path can't silently consume most of the test budget before failing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/replication_health_test.go` around lines 305 - 341,
assertAllNodesReplicationHealthy currently gives each helper
(assertReplicationSlotsHealthy, assertSpockSubscriptionsHealthy,
assertNoStaleSlots, assertReplicationLagNearZero) the full timeout, causing
worst-case wall time to multiply by nodes×helpers; change it to derive a shared
deadline from the provided timeout (e.g., ctx, cancel :=
context.WithTimeout(ctx, timeout) or use context.WithDeadline) and then compute
per-node or per-helper sub-timeouts from the remaining time before the deadline,
or pass the derived ctx with a per-helper timeout based on remaining time
(remaining / remainingHelpers) to each helper so the total work cannot exceed
the overall timeout budget. Ensure you pass the derived contexts into the helper
calls in assertAllNodesReplicationHealthy rather than the original raw timeout
value.

182-226: ⚡ Quick win

Skipping on connect failure can mask a real regression.

assertNoStaleSlots is the only helper here that silently returns when the primary connection fails (the others retry under waitFor). After a failover the primary may indeed be transiently unreachable, but if it stays unreachable for the duration of the test you'll never know — the assertion logs and returns successfully. Consider giving this helper a short retry budget (similar to the other helpers) so that a persistent connect failure surfaces as a test failure instead of a silent pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/replication_health_test.go` around lines 182 - 226, assertNoStaleSlots
currently returns silently on ConnectToInstance errors which can mask persistent
failures; change it to retry the connection for a short budget (e.g., a few
seconds/attempts) similar to other helpers that use waitFor. Specifically, wrap
the db.ConnectToInstance call in a small retry loop or use the existing waitFor
helper to attempt ConnectToInstance for a short timeout before giving up and
calling t.Fatalf or require.NoError, so that persistent inability to reach the
primary surfaces as a test failure; keep the rest of the function (rows
query/scan/assertions) unchanged and reference assertNoStaleSlots and
ConnectToInstance when making the change.

87-89: ⚡ Quick win

Silent rows.Err() failures hurt diagnostics.

When rows.Err() is non-nil the helper returns false and retries with no log line, so a persistent driver-level error (e.g., connection reset mid-scan) will only surface as the generic “did not become healthy within …” timeout. Mirror the logging done for connect/query/scan failures so the actual error is visible in the test output.

Proposed diff
-		if rows.Err() != nil {
-			return false
-		}
+		if err := rows.Err(); err != nil {
+			t.Logf("[replication-slots][%s] rows iteration failed: %v", nodeName, err)
+			return false
+		}

(apply the equivalent change in assertSpockSubscriptionsHealthy and assertReplicationLagNearZero.)

Also applies to: 159-161, 292-294

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/replication_health_test.go` around lines 87 - 89, The checks that do "if
rows.Err() != nil { return false }" silently drop the driver error; update each
occurrence (including the ones inside assertSpockSubscriptionsHealthy and
assertReplicationLagNearZero) to log the actual rows.Err() before returning
false so test output shows the root cause. Locate the rows.Err() checks in the
helper(s) and change them to call the test logger (e.g., t.Logf or the file's
existing logger) with a clear message including rows.Err() (matching the style
used for connect/query/scan failures), then return false.
e2e/failover_test.go (1)

99-114: ⚡ Quick win

waitForPrimaryChange / waitForPrimaryIs duplicated in three test files.

These two helpers (plus waitForReadyReplica and getPrimaryInstanceID) are now nearly identical in failover_test.go, switchover_test.go, and multi_node_replication_test.go. Hoisting them to a shared helper (in replication_health_test.go or a new topology_helpers_test.go) would keep their logging conventions and edge-case behavior in one place — useful since the elapsed-time + periodic-log treatment was just added in three places at once.

Also applies to: 116-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/failover_test.go` around lines 99 - 114, Extract the duplicated helpers
waitForPrimaryChange, waitForPrimaryIs, waitForReadyReplica, and
getPrimaryInstanceID into a single shared test helper file (e.g.,
topology_helpers_test.go) and update the tests to call those shared functions;
make the helpers package-level (unexported) functions that accept the same
parameters they currently use (e.g., t *testing.T, ctx context.Context, db
*YourDBType or whatever the test passes) so existing call sites don’t need
signature changes, preserve the elapsed/periodic logging behavior and return
values, remove the duplicated implementations from failover_test.go,
switchover_test.go, and multi_node_replication_test.go, and ensure the new
helper file imports testing, time, and any other packages used by the helpers.
e2e/switchover_test.go (1)

252-291: 💤 Low value

Cancel-verification doesn't wait past scheduledAt, so it only proves the task was marked canceled, not that execution was prevented.

The subtest cancels well before scheduledAt (+5m) and checks immediately that the primary is unchanged. Since the scheduler wouldn't have fired yet anyway, this confirms the cancel API succeeded but doesn't actually exercise the “canceled task never executes” guarantee. If you want this subtest to also catch a regression where a canceled task still fires, either schedule much sooner (e.g., +30s) and wait past scheduledAt before re-checking the primary, or keep this as-is and add a brief comment that it's a cancel-API smoke test rather than an end-to-end suppression test.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/switchover_test.go` around lines 252 - 291, The test schedules a
switchover far in the future (+5m) then cancels and immediately asserts the
primary unchanged, which only verifies cancellation API success not that the
task never executed; either (A) make this an actual suppression test by changing
scheduledAt to something soon (e.g., srvNow.Add(30*time.Second) in the
t.Run("cancel scheduled switchover" block) and after calling
fixture.Client.CancelDatabaseTask wait until after scheduledAt (e.g., sleep or
loop until time.Now().After(scheduledAt.Add(10*time.Second))) before calling
db.Refresh() and asserting getPrimaryInstanceID() equals origPrimary, or (B) if
you want to keep the long schedule, add a short clarifying comment above the
assertions stating this is only a cancel-API smoke test and not an end-to-end
suppression verification; update the code around serverNowUTC, scheduledAt,
fixture.Client.SwitchoverDatabaseNode, CancelDatabaseTask, WaitForDatabaseTask,
and db.Refresh accordingly.
server/internal/database/subscription_resource.go (1)

111-114: 💤 Low value

Silently ignoring all errors may hide connection or query issues.

The comment says errors are ignored because slot name is informational, but this also swallows connection failures or SQL syntax errors that could indicate deeper problems. Consider logging at debug/trace level when an error occurs.

💡 Suggested improvement to log errors at debug level
 	// Errors ignored: slot name is informational; absence doesn't affect subscription ops.
-	if slotName, err := postgres.GetSubscriptionSlotName(s.ProviderNode, s.SubscriberNode).Scalar(ctx, conn); err == nil {
+	slotName, err := postgres.GetSubscriptionSlotName(s.ProviderNode, s.SubscriberNode).Scalar(ctx, conn)
+	if err == nil {
 		s.ActualSlotName = slotName
+	} else {
+		// Log at debug level; slot name is informational and absence doesn't block operations.
+		_ = err // intentionally ignored
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/subscription_resource.go` around lines 111 - 114,
The code currently swallows all errors from
postgres.GetSubscriptionSlotName(...).Scalar(ctx, conn) which can hide
connection or query issues; update the block in subscription_resource.go (the
call to postgres.GetSubscriptionSlotName with s.ProviderNode and
s.SubscriberNode) to keep the existing behavior of only setting s.ActualSlotName
on success but, on err != nil, emit a debug/trace log entry including the error
and contextual fields (e.g., provider/subscriber node and conn context) using
the existing logger in scope so failures are recorded for diagnostics without
changing control flow.
server/internal/workflows/failover.go (1)

172-221: ⚖️ Poor tradeoff

Consider extracting shared reconciliation logic.

The post-topology-change reconciliation block (lines 172-221) is nearly identical between Switchover and Failover. Extracting this into a shared helper method would reduce duplication and ensure consistent behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/workflows/failover.go` around lines 172 - 221, Extract the
duplicated post-topology-change reconciliation block from the Failover and
Switchover flows into a single helper method (e.g.,
reconcileAfterTopologyChange) that encapsulates the confirmation loop plus calls
to ExecuteRefreshCurrentState, ExecutePlanUpdate, persistPlans and applyPlans;
the helper should accept the workflow context, the input struct (Refresh/Plan
inputs: DatabaseID, TaskID, Variables, Spec/Current where needed), and
identifiers used for the candidate confirmation (candidateID and instanceToQuery
or a boolean flag to skip confirmation), return any error/summary and emit the
same logging; then replace the inlined block in both Switchover and Failover
with a call to this new helper to remove duplication and keep behavior
identical.
server/internal/workflows/switchover.go (1)

196-221: ⚡ Quick win

No logging when primary confirmation times out.

The polling loop exits silently when the 60s deadline passes without confirming the candidate as primary. The reconciliation then proceeds anyway, which could apply stale state. Consider logging a warning when the timeout is reached without confirmation.

💡 Suggested improvement to log timeout condition
 		for workflow.Now(ctx).Before(deadline) {
 			primaryOut, err := w.Activities.ExecuteGetPrimaryInstance(ctx, getPrimaryQueue, &activities.GetPrimaryInstanceInput{
 				DatabaseID: in.DatabaseID,
 				InstanceID: instanceToQuery,
 			}).Get(ctx)
 			if err == nil && strings.HasSuffix(candidateID, primaryOut.PrimaryInstanceID) {
 				logger.Info("candidate confirmed as primary; proceeding with reconciliation",
 					"candidate", candidateID)
 				break
 			}
 			if sleepErr := workflow.Sleep(ctx, confirmPoll); sleepErr != nil {
 				logger.Warn("context interrupted while waiting for primary confirmation")
 				break
 			}
 		}
+		if !workflow.Now(ctx).Before(deadline) {
+			logger.Warn("primary confirmation timed out; proceeding with reconciliation anyway",
+				"candidate", candidateID, "timeout", confirmTimeout)
+		}

 		logger.Info("running post-switchover reconciliation to update subscription DSNs")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/workflows/switchover.go` around lines 196 - 221, The polling
loop that confirms the candidate as primary should emit a warning when the 60s
confirmation deadline elapses; modify the polling code so that if the loop exits
due to timeout (no confirmed primary) you call logger.With("database_id",
in.DatabaseID, "task_id", in.TaskID).Warn("primary confirmation timeout;
proceeding with post-switchover reconciliation — subscription DSNs may be
stale") immediately before continuing to the existing post-switchover
reconciliation (i.e., before ExecuteRefreshCurrentState is invoked), and include
any timeout/error detail available from the poll so the subsequent calls
(ExecuteRefreshCurrentState, ExecutePlanUpdate, persistPlans, applyPlans) are
annotated with context.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/database/replication_slot_resource.go`:
- Around line 91-94: The code currently treats any error from
resource.FromContext[*SubscriptionResource](rc,
SubscriptionResourceIdentifier(...)) the same as an absent subscription; change
this to distinguish "not found" from real errors: call resource.FromContext as
before, then if err != nil inspect whether it's the expected not-found error
(the sentinel or type returned by resource package for missing resources) and
return nil only in that case, but for other errors return or propagate the error
so configuration/type issues surface; keep the existing empty ActualSlotName
check (sub.ActualSlotName == "") as a separate no-op condition.

In `@server/internal/postgres/create_db.go`:
- Around line 454-460: The FindSpockSlot function's SQL must be scoped to the
current database to avoid matching slots from other databases; update the Query
in FindSpockSlot to add a "database = current_database()" predicate to the WHERE
clause (keeping the existing plugin and RIGHT(slot_name...) conditions) so the
lookup is deterministic and tied to the current DB while leaving Args and
function signature unchanged.

In `@server/internal/workflows/switchover.go`:
- Around line 180-194: Replace the fuzzy match using strings.HasSuffix with an
exact equality check when confirming the candidate instance is primary: in the
switchover loop where you call w.Activities.ExecuteGetPrimaryInstance and
compare candidateID to primaryOut.PrimaryInstanceID, change the condition from
strings.HasSuffix(candidateID, primaryOut.PrimaryInstanceID) to candidateID ==
primaryOut.PrimaryInstanceID so the full instance IDs are compared exactly;
apply the same exact-equality change in the analogous comparison in
server/internal/workflows/failover.go as well.

---

Nitpick comments:
In `@e2e/database_test.go`:
- Around line 396-404: Extract the hardcoded 120 into a named constant (e.g.
waitForSyncEventTimeoutSecs or waitForSyncEventTimeout) and use it in both the
SQL call to spock.wait_for_sync_event and the assertion message; locate the
block that calls row := conn.QueryRow(ctx, "CALL spock.wait_for_sync_event(true,
$1, $2::pg_lsn, 120);", peerNode, lsn) and the subsequent assert.Truef that
prints "within 120s", replace the literal with the constant (or a time.Duration
converted to seconds for the message) so the timeout value stays in sync across
the QueryRow call and the assertion formatting.

In `@e2e/failover_test.go`:
- Around line 99-114: Extract the duplicated helpers waitForPrimaryChange,
waitForPrimaryIs, waitForReadyReplica, and getPrimaryInstanceID into a single
shared test helper file (e.g., topology_helpers_test.go) and update the tests to
call those shared functions; make the helpers package-level (unexported)
functions that accept the same parameters they currently use (e.g., t
*testing.T, ctx context.Context, db *YourDBType or whatever the test passes) so
existing call sites don’t need signature changes, preserve the elapsed/periodic
logging behavior and return values, remove the duplicated implementations from
failover_test.go, switchover_test.go, and multi_node_replication_test.go, and
ensure the new helper file imports testing, time, and any other packages used by
the helpers.

In `@e2e/multi_node_replication_test.go`:
- Around line 19-27: The 25-minute outer context in
TestMultiNodeReplicationHealth is too short for six sequential topology subtests
(each calling assertAllNodesReplicationHealthy and WaitForReplication); extend
the context timeout created by context.WithTimeout to 35–45 minutes (e.g., 40
minutes) or alternatively reduce the per-helper timeouts used inside
assertAllNodesReplicationHealthy and the WaitForReplication call to keep total
runtime under the existing 25-minute limit; update the timeout value where
context.WithTimeout is called in TestMultiNodeReplicationHealth and/or adjust
the helper timeout constants used by assertAllNodesReplicationHealthy and
WaitForReplication accordingly.
- Around line 282-332: The test creates a repl_integrity table on n1 but never
drops it and assumes DDL replication; update the subtest to (1) register a
t.Cleanup that drops the repl_integrity table (use the same db.WithConnection +
connOpts("n1") pattern and DROP TABLE IF EXISTS repl_integrity) so the table is
removed after the test, and (2) add an explicit assertion after
db.WaitForReplication that the relation exists on each node (use
db.WithConnection + connOpts(node.Name) and a simple query like SELECT to
confirm the table exists or fail with a clear message) to make the
DDL-replication contract explicit (references: repl_integrity, CREATE TABLE IF
NOT EXISTS repl_integrity, t.Cleanup, db.WithConnection, connOpts,
db.WaitForReplication).
- Around line 63-71: The test currently ignores the boolean result of waitFor
that polls db.Refresh over db.Instances, which can let the test proceed while
instances remain "creating" or "modifying"; change the call so the test asserts
the result (e.g., use require.Truef(t, waitFor(...), "cluster did not settle
within 90s")) so that if waitFor returns false the test fails immediately with a
clear message referencing the cluster did not finish provisioning; keep the same
waitFor predicate (calling db.Refresh(ctx) and checking inst.State) and the
90*time.Second timeout but wrap it with the require.Truef assertion using the
test handle t.

In `@e2e/replication_health_test.go`:
- Around line 305-341: assertAllNodesReplicationHealthy currently gives each
helper (assertReplicationSlotsHealthy, assertSpockSubscriptionsHealthy,
assertNoStaleSlots, assertReplicationLagNearZero) the full timeout, causing
worst-case wall time to multiply by nodes×helpers; change it to derive a shared
deadline from the provided timeout (e.g., ctx, cancel :=
context.WithTimeout(ctx, timeout) or use context.WithDeadline) and then compute
per-node or per-helper sub-timeouts from the remaining time before the deadline,
or pass the derived ctx with a per-helper timeout based on remaining time
(remaining / remainingHelpers) to each helper so the total work cannot exceed
the overall timeout budget. Ensure you pass the derived contexts into the helper
calls in assertAllNodesReplicationHealthy rather than the original raw timeout
value.
- Around line 182-226: assertNoStaleSlots currently returns silently on
ConnectToInstance errors which can mask persistent failures; change it to retry
the connection for a short budget (e.g., a few seconds/attempts) similar to
other helpers that use waitFor. Specifically, wrap the db.ConnectToInstance call
in a small retry loop or use the existing waitFor helper to attempt
ConnectToInstance for a short timeout before giving up and calling t.Fatalf or
require.NoError, so that persistent inability to reach the primary surfaces as a
test failure; keep the rest of the function (rows query/scan/assertions)
unchanged and reference assertNoStaleSlots and ConnectToInstance when making the
change.
- Around line 87-89: The checks that do "if rows.Err() != nil { return false }"
silently drop the driver error; update each occurrence (including the ones
inside assertSpockSubscriptionsHealthy and assertReplicationLagNearZero) to log
the actual rows.Err() before returning false so test output shows the root
cause. Locate the rows.Err() checks in the helper(s) and change them to call the
test logger (e.g., t.Logf or the file's existing logger) with a clear message
including rows.Err() (matching the style used for connect/query/scan failures),
then return false.

In `@e2e/switchover_test.go`:
- Around line 252-291: The test schedules a switchover far in the future (+5m)
then cancels and immediately asserts the primary unchanged, which only verifies
cancellation API success not that the task never executed; either (A) make this
an actual suppression test by changing scheduledAt to something soon (e.g.,
srvNow.Add(30*time.Second) in the t.Run("cancel scheduled switchover" block) and
after calling fixture.Client.CancelDatabaseTask wait until after scheduledAt
(e.g., sleep or loop until time.Now().After(scheduledAt.Add(10*time.Second)))
before calling db.Refresh() and asserting getPrimaryInstanceID() equals
origPrimary, or (B) if you want to keep the long schedule, add a short
clarifying comment above the assertions stating this is only a cancel-API smoke
test and not an end-to-end suppression verification; update the code around
serverNowUTC, scheduledAt, fixture.Client.SwitchoverDatabaseNode,
CancelDatabaseTask, WaitForDatabaseTask, and db.Refresh accordingly.

In `@server/internal/database/subscription_resource.go`:
- Around line 111-114: The code currently swallows all errors from
postgres.GetSubscriptionSlotName(...).Scalar(ctx, conn) which can hide
connection or query issues; update the block in subscription_resource.go (the
call to postgres.GetSubscriptionSlotName with s.ProviderNode and
s.SubscriberNode) to keep the existing behavior of only setting s.ActualSlotName
on success but, on err != nil, emit a debug/trace log entry including the error
and contextual fields (e.g., provider/subscriber node and conn context) using
the existing logger in scope so failures are recorded for diagnostics without
changing control flow.

In `@server/internal/workflows/failover.go`:
- Around line 172-221: Extract the duplicated post-topology-change
reconciliation block from the Failover and Switchover flows into a single helper
method (e.g., reconcileAfterTopologyChange) that encapsulates the confirmation
loop plus calls to ExecuteRefreshCurrentState, ExecutePlanUpdate, persistPlans
and applyPlans; the helper should accept the workflow context, the input struct
(Refresh/Plan inputs: DatabaseID, TaskID, Variables, Spec/Current where needed),
and identifiers used for the candidate confirmation (candidateID and
instanceToQuery or a boolean flag to skip confirmation), return any
error/summary and emit the same logging; then replace the inlined block in both
Switchover and Failover with a call to this new helper to remove duplication and
keep behavior identical.

In `@server/internal/workflows/switchover.go`:
- Around line 196-221: The polling loop that confirms the candidate as primary
should emit a warning when the 60s confirmation deadline elapses; modify the
polling code so that if the loop exits due to timeout (no confirmed primary) you
call logger.With("database_id", in.DatabaseID, "task_id",
in.TaskID).Warn("primary confirmation timeout; proceeding with post-switchover
reconciliation — subscription DSNs may be stale") immediately before continuing
to the existing post-switchover reconciliation (i.e., before
ExecuteRefreshCurrentState is invoked), and include any timeout/error detail
available from the poll so the subsequent calls (ExecuteRefreshCurrentState,
ExecutePlanUpdate, persistPlans, applyPlans) are annotated with context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d599db3c-a0dd-4262-9ccc-f2b8b0e58097

📥 Commits

Reviewing files that changed from the base of the PR and between 00fd2d7 and e173c76.

📒 Files selected for processing (11)
  • e2e/database_test.go
  • e2e/failover_test.go
  • e2e/multi_node_replication_test.go
  • e2e/replication_health_test.go
  • e2e/switchover_test.go
  • server/internal/api/apiv1/post_init_handlers.go
  • server/internal/database/replication_slot_resource.go
  • server/internal/database/subscription_resource.go
  • server/internal/postgres/create_db.go
  • server/internal/workflows/failover.go
  • server/internal/workflows/switchover.go

Comment thread server/internal/database/replication_slot_resource.go Outdated
Comment thread server/internal/postgres/create_db.go Outdated
Comment thread server/internal/workflows/switchover.go Outdated
@tsivaprasad tsivaprasad force-pushed the PLAT-612-enhance-failover-switchover-e-2-e-test-to-validate-replication branch from 84e2c15 to 8abb644 Compare May 25, 2026 06:20
@tsivaprasad tsivaprasad changed the title fix: repair Spock slots/subscriptions post-switchover; add replication health e2e tests test(e2e): add WaitForReplication after first switchover/failover May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant