test(e2e): add WaitForReplication after first switchover/failover#390
test(e2e): add WaitForReplication after first switchover/failover#390tsivaprasad wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThe PR adds replication health verification to two E2E test scenarios. Both the failover and switchover tests now wait for replication to become healthy after confirming a primary change in their automatic candidate selection flows, ensuring consistent state before subsequent assertions execute. ChangesE2E Test Replication Health Verification
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (12)
e2e/multi_node_replication_test.go (3)
19-27: 💤 Low value25-minute deadline may be too tight for six sequential topology subtests.
The outer context is 25 minutes, and each topology subtest performs a switchover/failover followed by
assertAllNodesReplicationHealthy(which can spend up to ~9 minutes per call given the per-helper 90s timeout for a two-node cluster) plus aWaitForReplication(now 120s). Under slow CI or post-topology stalls, the cumulative budget can exceed 25 minutes and the failures will surface as context-cancellation errors rather than the actual condition that hung. Consider bumping to 35–45 minutes or tightening the per-helper timeouts insideassertAllNodesReplicationHealthy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/multi_node_replication_test.go` around lines 19 - 27, The 25-minute outer context in TestMultiNodeReplicationHealth is too short for six sequential topology subtests (each calling assertAllNodesReplicationHealthy and WaitForReplication); extend the context timeout created by context.WithTimeout to 35–45 minutes (e.g., 40 minutes) or alternatively reduce the per-helper timeouts used inside assertAllNodesReplicationHealthy and the WaitForReplication call to keep total runtime under the existing 25-minute limit; update the timeout value where context.WithTimeout is called in TestMultiNodeReplicationHealth and/or adjust the helper timeout constants used by assertAllNodesReplicationHealthy and WaitForReplication accordingly.
282-332: 💤 Low value
repl_integritytable is created but never dropped, and DDL replication is assumed.Two concerns worth verifying:
CREATE TABLE IF NOT EXISTS repl_integrityruns on the n1 primary, and the subsequentCOUNT(*)runs on every node's primary including n2. This only works if your Spock setup auto-replicates DDL (e.g.,spock.replicate_ddl_commandis on or DDL is in a replication set). If a future configuration change disables that, this subtest will fail with a confusing "relation does not exist" instead of a clear contract violation — worth adding an explicit assertion or a comment.- The table is left behind in the test database. Since the fixture is per-test it's harmless today, but if anyone later reuses the fixture across tests the
ON CONFLICTupsert hides leaked state. At.Cleanupthat drops the table would make the subtest self-contained.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/multi_node_replication_test.go` around lines 282 - 332, The test creates a repl_integrity table on n1 but never drops it and assumes DDL replication; update the subtest to (1) register a t.Cleanup that drops the repl_integrity table (use the same db.WithConnection + connOpts("n1") pattern and DROP TABLE IF EXISTS repl_integrity) so the table is removed after the test, and (2) add an explicit assertion after db.WaitForReplication that the relation exists on each node (use db.WithConnection + connOpts(node.Name) and a simple query like SELECT to confirm the table exists or fail with a clear message) to make the DDL-replication contract explicit (references: repl_integrity, CREATE TABLE IF NOT EXISTS repl_integrity, t.Cleanup, db.WithConnection, connOpts, db.WaitForReplication).
63-71: ⚡ Quick winIgnoring the
waitForresult lets the test proceed against a still-creating/modifyingcluster.If the 90s budget elapses without all instances settling,
waitForreturnsfalseand the test silently continues into the baselineWaitForReplicationand topology subtests. The downstream failure will then be obscure (timeout in some assertion deep inside a subtest) instead of a clear “cluster did not finish provisioning”. Arequire.Truef(t, waitFor(...), "cluster did not settle within 90s")makes the failure mode actionable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/multi_node_replication_test.go` around lines 63 - 71, The test currently ignores the boolean result of waitFor that polls db.Refresh over db.Instances, which can let the test proceed while instances remain "creating" or "modifying"; change the call so the test asserts the result (e.g., use require.Truef(t, waitFor(...), "cluster did not settle within 90s")) so that if waitFor returns false the test fails immediately with a clear message referencing the cluster did not finish provisioning; keep the same waitFor predicate (calling db.Refresh(ctx) and checking inst.State) and the 90*time.Second timeout but wrap it with the require.Truef assertion using the test handle t.e2e/database_test.go (1)
396-404: 💤 Low valueExtract the 120-second budget into a constant.
The
120literal appears both in thewait_for_sync_eventSQL argument and in the assertion message. They must stay in lockstep; a smallconst waitForSyncEventTimeoutSecs = 120(or atime.Duration) used in both places would prevent silent drift if the budget is tuned again later.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/database_test.go` around lines 396 - 404, Extract the hardcoded 120 into a named constant (e.g. waitForSyncEventTimeoutSecs or waitForSyncEventTimeout) and use it in both the SQL call to spock.wait_for_sync_event and the assertion message; locate the block that calls row := conn.QueryRow(ctx, "CALL spock.wait_for_sync_event(true, $1, $2::pg_lsn, 120);", peerNode, lsn) and the subsequent assert.Truef that prints "within 120s", replace the literal with the constant (or a time.Duration converted to seconds for the message) so the timeout value stays in sync across the QueryRow call and the assertion formatting.e2e/replication_health_test.go (3)
305-341: 💤 Low value
timeoutis applied per-helper-per-node, so worst-case wall time scales withnodes × helpers × timeout.
assertAllNodesReplicationHealthyinvokes three polling helpers per node, each with its own fulltimeoutbudget (currently 90s). For a 2-node cluster this is up to2 × 3 × 90s = 9mper call, and the helper is called several times per scenario in the new multi-node test inside an outer 25m budget. Consider deriving per-helper timeouts from a shared deadline (e.g.,context.WithTimeoutderived fromtimeout) or shrinking individual budgets so a slow path can't silently consume most of the test budget before failing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/replication_health_test.go` around lines 305 - 341, assertAllNodesReplicationHealthy currently gives each helper (assertReplicationSlotsHealthy, assertSpockSubscriptionsHealthy, assertNoStaleSlots, assertReplicationLagNearZero) the full timeout, causing worst-case wall time to multiply by nodes×helpers; change it to derive a shared deadline from the provided timeout (e.g., ctx, cancel := context.WithTimeout(ctx, timeout) or use context.WithDeadline) and then compute per-node or per-helper sub-timeouts from the remaining time before the deadline, or pass the derived ctx with a per-helper timeout based on remaining time (remaining / remainingHelpers) to each helper so the total work cannot exceed the overall timeout budget. Ensure you pass the derived contexts into the helper calls in assertAllNodesReplicationHealthy rather than the original raw timeout value.
182-226: ⚡ Quick winSkipping on connect failure can mask a real regression.
assertNoStaleSlotsis the only helper here that silently returns when the primary connection fails (the others retry underwaitFor). After a failover the primary may indeed be transiently unreachable, but if it stays unreachable for the duration of the test you'll never know — the assertion logs and returns successfully. Consider giving this helper a short retry budget (similar to the other helpers) so that a persistent connect failure surfaces as a test failure instead of a silent pass.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/replication_health_test.go` around lines 182 - 226, assertNoStaleSlots currently returns silently on ConnectToInstance errors which can mask persistent failures; change it to retry the connection for a short budget (e.g., a few seconds/attempts) similar to other helpers that use waitFor. Specifically, wrap the db.ConnectToInstance call in a small retry loop or use the existing waitFor helper to attempt ConnectToInstance for a short timeout before giving up and calling t.Fatalf or require.NoError, so that persistent inability to reach the primary surfaces as a test failure; keep the rest of the function (rows query/scan/assertions) unchanged and reference assertNoStaleSlots and ConnectToInstance when making the change.
87-89: ⚡ Quick winSilent
rows.Err()failures hurt diagnostics.When
rows.Err()is non-nil the helper returnsfalseand retries with no log line, so a persistent driver-level error (e.g., connection reset mid-scan) will only surface as the generic “did not become healthy within …” timeout. Mirror the logging done for connect/query/scan failures so the actual error is visible in the test output.Proposed diff
- if rows.Err() != nil { - return false - } + if err := rows.Err(); err != nil { + t.Logf("[replication-slots][%s] rows iteration failed: %v", nodeName, err) + return false + }(apply the equivalent change in
assertSpockSubscriptionsHealthyandassertReplicationLagNearZero.)Also applies to: 159-161, 292-294
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/replication_health_test.go` around lines 87 - 89, The checks that do "if rows.Err() != nil { return false }" silently drop the driver error; update each occurrence (including the ones inside assertSpockSubscriptionsHealthy and assertReplicationLagNearZero) to log the actual rows.Err() before returning false so test output shows the root cause. Locate the rows.Err() checks in the helper(s) and change them to call the test logger (e.g., t.Logf or the file's existing logger) with a clear message including rows.Err() (matching the style used for connect/query/scan failures), then return false.e2e/failover_test.go (1)
99-114: ⚡ Quick win
waitForPrimaryChange/waitForPrimaryIsduplicated in three test files.These two helpers (plus
waitForReadyReplicaandgetPrimaryInstanceID) are now nearly identical infailover_test.go,switchover_test.go, andmulti_node_replication_test.go. Hoisting them to a shared helper (inreplication_health_test.goor a newtopology_helpers_test.go) would keep their logging conventions and edge-case behavior in one place — useful since the elapsed-time + periodic-log treatment was just added in three places at once.Also applies to: 116-131
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/failover_test.go` around lines 99 - 114, Extract the duplicated helpers waitForPrimaryChange, waitForPrimaryIs, waitForReadyReplica, and getPrimaryInstanceID into a single shared test helper file (e.g., topology_helpers_test.go) and update the tests to call those shared functions; make the helpers package-level (unexported) functions that accept the same parameters they currently use (e.g., t *testing.T, ctx context.Context, db *YourDBType or whatever the test passes) so existing call sites don’t need signature changes, preserve the elapsed/periodic logging behavior and return values, remove the duplicated implementations from failover_test.go, switchover_test.go, and multi_node_replication_test.go, and ensure the new helper file imports testing, time, and any other packages used by the helpers.e2e/switchover_test.go (1)
252-291: 💤 Low valueCancel-verification doesn't wait past
scheduledAt, so it only proves the task was marked canceled, not that execution was prevented.The subtest cancels well before
scheduledAt(+5m) and checks immediately that the primary is unchanged. Since the scheduler wouldn't have fired yet anyway, this confirms the cancel API succeeded but doesn't actually exercise the “canceled task never executes” guarantee. If you want this subtest to also catch a regression where a canceled task still fires, either schedule much sooner (e.g., +30s) and wait pastscheduledAtbefore re-checking the primary, or keep this as-is and add a brief comment that it's a cancel-API smoke test rather than an end-to-end suppression test.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/switchover_test.go` around lines 252 - 291, The test schedules a switchover far in the future (+5m) then cancels and immediately asserts the primary unchanged, which only verifies cancellation API success not that the task never executed; either (A) make this an actual suppression test by changing scheduledAt to something soon (e.g., srvNow.Add(30*time.Second) in the t.Run("cancel scheduled switchover" block) and after calling fixture.Client.CancelDatabaseTask wait until after scheduledAt (e.g., sleep or loop until time.Now().After(scheduledAt.Add(10*time.Second))) before calling db.Refresh() and asserting getPrimaryInstanceID() equals origPrimary, or (B) if you want to keep the long schedule, add a short clarifying comment above the assertions stating this is only a cancel-API smoke test and not an end-to-end suppression verification; update the code around serverNowUTC, scheduledAt, fixture.Client.SwitchoverDatabaseNode, CancelDatabaseTask, WaitForDatabaseTask, and db.Refresh accordingly.server/internal/database/subscription_resource.go (1)
111-114: 💤 Low valueSilently ignoring all errors may hide connection or query issues.
The comment says errors are ignored because slot name is informational, but this also swallows connection failures or SQL syntax errors that could indicate deeper problems. Consider logging at debug/trace level when an error occurs.
💡 Suggested improvement to log errors at debug level
// Errors ignored: slot name is informational; absence doesn't affect subscription ops. - if slotName, err := postgres.GetSubscriptionSlotName(s.ProviderNode, s.SubscriberNode).Scalar(ctx, conn); err == nil { + slotName, err := postgres.GetSubscriptionSlotName(s.ProviderNode, s.SubscriberNode).Scalar(ctx, conn) + if err == nil { s.ActualSlotName = slotName + } else { + // Log at debug level; slot name is informational and absence doesn't block operations. + _ = err // intentionally ignored }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/database/subscription_resource.go` around lines 111 - 114, The code currently swallows all errors from postgres.GetSubscriptionSlotName(...).Scalar(ctx, conn) which can hide connection or query issues; update the block in subscription_resource.go (the call to postgres.GetSubscriptionSlotName with s.ProviderNode and s.SubscriberNode) to keep the existing behavior of only setting s.ActualSlotName on success but, on err != nil, emit a debug/trace log entry including the error and contextual fields (e.g., provider/subscriber node and conn context) using the existing logger in scope so failures are recorded for diagnostics without changing control flow.server/internal/workflows/failover.go (1)
172-221: ⚖️ Poor tradeoffConsider extracting shared reconciliation logic.
The post-topology-change reconciliation block (lines 172-221) is nearly identical between
SwitchoverandFailover. Extracting this into a shared helper method would reduce duplication and ensure consistent behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/workflows/failover.go` around lines 172 - 221, Extract the duplicated post-topology-change reconciliation block from the Failover and Switchover flows into a single helper method (e.g., reconcileAfterTopologyChange) that encapsulates the confirmation loop plus calls to ExecuteRefreshCurrentState, ExecutePlanUpdate, persistPlans and applyPlans; the helper should accept the workflow context, the input struct (Refresh/Plan inputs: DatabaseID, TaskID, Variables, Spec/Current where needed), and identifiers used for the candidate confirmation (candidateID and instanceToQuery or a boolean flag to skip confirmation), return any error/summary and emit the same logging; then replace the inlined block in both Switchover and Failover with a call to this new helper to remove duplication and keep behavior identical.server/internal/workflows/switchover.go (1)
196-221: ⚡ Quick winNo logging when primary confirmation times out.
The polling loop exits silently when the 60s deadline passes without confirming the candidate as primary. The reconciliation then proceeds anyway, which could apply stale state. Consider logging a warning when the timeout is reached without confirmation.
💡 Suggested improvement to log timeout condition
for workflow.Now(ctx).Before(deadline) { primaryOut, err := w.Activities.ExecuteGetPrimaryInstance(ctx, getPrimaryQueue, &activities.GetPrimaryInstanceInput{ DatabaseID: in.DatabaseID, InstanceID: instanceToQuery, }).Get(ctx) if err == nil && strings.HasSuffix(candidateID, primaryOut.PrimaryInstanceID) { logger.Info("candidate confirmed as primary; proceeding with reconciliation", "candidate", candidateID) break } if sleepErr := workflow.Sleep(ctx, confirmPoll); sleepErr != nil { logger.Warn("context interrupted while waiting for primary confirmation") break } } + if !workflow.Now(ctx).Before(deadline) { + logger.Warn("primary confirmation timed out; proceeding with reconciliation anyway", + "candidate", candidateID, "timeout", confirmTimeout) + } logger.Info("running post-switchover reconciliation to update subscription DSNs")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/workflows/switchover.go` around lines 196 - 221, The polling loop that confirms the candidate as primary should emit a warning when the 60s confirmation deadline elapses; modify the polling code so that if the loop exits due to timeout (no confirmed primary) you call logger.With("database_id", in.DatabaseID, "task_id", in.TaskID).Warn("primary confirmation timeout; proceeding with post-switchover reconciliation — subscription DSNs may be stale") immediately before continuing to the existing post-switchover reconciliation (i.e., before ExecuteRefreshCurrentState is invoked), and include any timeout/error detail available from the poll so the subsequent calls (ExecuteRefreshCurrentState, ExecutePlanUpdate, persistPlans, applyPlans) are annotated with context.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/internal/database/replication_slot_resource.go`:
- Around line 91-94: The code currently treats any error from
resource.FromContext[*SubscriptionResource](rc,
SubscriptionResourceIdentifier(...)) the same as an absent subscription; change
this to distinguish "not found" from real errors: call resource.FromContext as
before, then if err != nil inspect whether it's the expected not-found error
(the sentinel or type returned by resource package for missing resources) and
return nil only in that case, but for other errors return or propagate the error
so configuration/type issues surface; keep the existing empty ActualSlotName
check (sub.ActualSlotName == "") as a separate no-op condition.
In `@server/internal/postgres/create_db.go`:
- Around line 454-460: The FindSpockSlot function's SQL must be scoped to the
current database to avoid matching slots from other databases; update the Query
in FindSpockSlot to add a "database = current_database()" predicate to the WHERE
clause (keeping the existing plugin and RIGHT(slot_name...) conditions) so the
lookup is deterministic and tied to the current DB while leaving Args and
function signature unchanged.
In `@server/internal/workflows/switchover.go`:
- Around line 180-194: Replace the fuzzy match using strings.HasSuffix with an
exact equality check when confirming the candidate instance is primary: in the
switchover loop where you call w.Activities.ExecuteGetPrimaryInstance and
compare candidateID to primaryOut.PrimaryInstanceID, change the condition from
strings.HasSuffix(candidateID, primaryOut.PrimaryInstanceID) to candidateID ==
primaryOut.PrimaryInstanceID so the full instance IDs are compared exactly;
apply the same exact-equality change in the analogous comparison in
server/internal/workflows/failover.go as well.
---
Nitpick comments:
In `@e2e/database_test.go`:
- Around line 396-404: Extract the hardcoded 120 into a named constant (e.g.
waitForSyncEventTimeoutSecs or waitForSyncEventTimeout) and use it in both the
SQL call to spock.wait_for_sync_event and the assertion message; locate the
block that calls row := conn.QueryRow(ctx, "CALL spock.wait_for_sync_event(true,
$1, $2::pg_lsn, 120);", peerNode, lsn) and the subsequent assert.Truef that
prints "within 120s", replace the literal with the constant (or a time.Duration
converted to seconds for the message) so the timeout value stays in sync across
the QueryRow call and the assertion formatting.
In `@e2e/failover_test.go`:
- Around line 99-114: Extract the duplicated helpers waitForPrimaryChange,
waitForPrimaryIs, waitForReadyReplica, and getPrimaryInstanceID into a single
shared test helper file (e.g., topology_helpers_test.go) and update the tests to
call those shared functions; make the helpers package-level (unexported)
functions that accept the same parameters they currently use (e.g., t
*testing.T, ctx context.Context, db *YourDBType or whatever the test passes) so
existing call sites don’t need signature changes, preserve the elapsed/periodic
logging behavior and return values, remove the duplicated implementations from
failover_test.go, switchover_test.go, and multi_node_replication_test.go, and
ensure the new helper file imports testing, time, and any other packages used by
the helpers.
In `@e2e/multi_node_replication_test.go`:
- Around line 19-27: The 25-minute outer context in
TestMultiNodeReplicationHealth is too short for six sequential topology subtests
(each calling assertAllNodesReplicationHealthy and WaitForReplication); extend
the context timeout created by context.WithTimeout to 35–45 minutes (e.g., 40
minutes) or alternatively reduce the per-helper timeouts used inside
assertAllNodesReplicationHealthy and the WaitForReplication call to keep total
runtime under the existing 25-minute limit; update the timeout value where
context.WithTimeout is called in TestMultiNodeReplicationHealth and/or adjust
the helper timeout constants used by assertAllNodesReplicationHealthy and
WaitForReplication accordingly.
- Around line 282-332: The test creates a repl_integrity table on n1 but never
drops it and assumes DDL replication; update the subtest to (1) register a
t.Cleanup that drops the repl_integrity table (use the same db.WithConnection +
connOpts("n1") pattern and DROP TABLE IF EXISTS repl_integrity) so the table is
removed after the test, and (2) add an explicit assertion after
db.WaitForReplication that the relation exists on each node (use
db.WithConnection + connOpts(node.Name) and a simple query like SELECT to
confirm the table exists or fail with a clear message) to make the
DDL-replication contract explicit (references: repl_integrity, CREATE TABLE IF
NOT EXISTS repl_integrity, t.Cleanup, db.WithConnection, connOpts,
db.WaitForReplication).
- Around line 63-71: The test currently ignores the boolean result of waitFor
that polls db.Refresh over db.Instances, which can let the test proceed while
instances remain "creating" or "modifying"; change the call so the test asserts
the result (e.g., use require.Truef(t, waitFor(...), "cluster did not settle
within 90s")) so that if waitFor returns false the test fails immediately with a
clear message referencing the cluster did not finish provisioning; keep the same
waitFor predicate (calling db.Refresh(ctx) and checking inst.State) and the
90*time.Second timeout but wrap it with the require.Truef assertion using the
test handle t.
In `@e2e/replication_health_test.go`:
- Around line 305-341: assertAllNodesReplicationHealthy currently gives each
helper (assertReplicationSlotsHealthy, assertSpockSubscriptionsHealthy,
assertNoStaleSlots, assertReplicationLagNearZero) the full timeout, causing
worst-case wall time to multiply by nodes×helpers; change it to derive a shared
deadline from the provided timeout (e.g., ctx, cancel :=
context.WithTimeout(ctx, timeout) or use context.WithDeadline) and then compute
per-node or per-helper sub-timeouts from the remaining time before the deadline,
or pass the derived ctx with a per-helper timeout based on remaining time
(remaining / remainingHelpers) to each helper so the total work cannot exceed
the overall timeout budget. Ensure you pass the derived contexts into the helper
calls in assertAllNodesReplicationHealthy rather than the original raw timeout
value.
- Around line 182-226: assertNoStaleSlots currently returns silently on
ConnectToInstance errors which can mask persistent failures; change it to retry
the connection for a short budget (e.g., a few seconds/attempts) similar to
other helpers that use waitFor. Specifically, wrap the db.ConnectToInstance call
in a small retry loop or use the existing waitFor helper to attempt
ConnectToInstance for a short timeout before giving up and calling t.Fatalf or
require.NoError, so that persistent inability to reach the primary surfaces as a
test failure; keep the rest of the function (rows query/scan/assertions)
unchanged and reference assertNoStaleSlots and ConnectToInstance when making the
change.
- Around line 87-89: The checks that do "if rows.Err() != nil { return false }"
silently drop the driver error; update each occurrence (including the ones
inside assertSpockSubscriptionsHealthy and assertReplicationLagNearZero) to log
the actual rows.Err() before returning false so test output shows the root
cause. Locate the rows.Err() checks in the helper(s) and change them to call the
test logger (e.g., t.Logf or the file's existing logger) with a clear message
including rows.Err() (matching the style used for connect/query/scan failures),
then return false.
In `@e2e/switchover_test.go`:
- Around line 252-291: The test schedules a switchover far in the future (+5m)
then cancels and immediately asserts the primary unchanged, which only verifies
cancellation API success not that the task never executed; either (A) make this
an actual suppression test by changing scheduledAt to something soon (e.g.,
srvNow.Add(30*time.Second) in the t.Run("cancel scheduled switchover" block) and
after calling fixture.Client.CancelDatabaseTask wait until after scheduledAt
(e.g., sleep or loop until time.Now().After(scheduledAt.Add(10*time.Second)))
before calling db.Refresh() and asserting getPrimaryInstanceID() equals
origPrimary, or (B) if you want to keep the long schedule, add a short
clarifying comment above the assertions stating this is only a cancel-API smoke
test and not an end-to-end suppression verification; update the code around
serverNowUTC, scheduledAt, fixture.Client.SwitchoverDatabaseNode,
CancelDatabaseTask, WaitForDatabaseTask, and db.Refresh accordingly.
In `@server/internal/database/subscription_resource.go`:
- Around line 111-114: The code currently swallows all errors from
postgres.GetSubscriptionSlotName(...).Scalar(ctx, conn) which can hide
connection or query issues; update the block in subscription_resource.go (the
call to postgres.GetSubscriptionSlotName with s.ProviderNode and
s.SubscriberNode) to keep the existing behavior of only setting s.ActualSlotName
on success but, on err != nil, emit a debug/trace log entry including the error
and contextual fields (e.g., provider/subscriber node and conn context) using
the existing logger in scope so failures are recorded for diagnostics without
changing control flow.
In `@server/internal/workflows/failover.go`:
- Around line 172-221: Extract the duplicated post-topology-change
reconciliation block from the Failover and Switchover flows into a single helper
method (e.g., reconcileAfterTopologyChange) that encapsulates the confirmation
loop plus calls to ExecuteRefreshCurrentState, ExecutePlanUpdate, persistPlans
and applyPlans; the helper should accept the workflow context, the input struct
(Refresh/Plan inputs: DatabaseID, TaskID, Variables, Spec/Current where needed),
and identifiers used for the candidate confirmation (candidateID and
instanceToQuery or a boolean flag to skip confirmation), return any
error/summary and emit the same logging; then replace the inlined block in both
Switchover and Failover with a call to this new helper to remove duplication and
keep behavior identical.
In `@server/internal/workflows/switchover.go`:
- Around line 196-221: The polling loop that confirms the candidate as primary
should emit a warning when the 60s confirmation deadline elapses; modify the
polling code so that if the loop exits due to timeout (no confirmed primary) you
call logger.With("database_id", in.DatabaseID, "task_id",
in.TaskID).Warn("primary confirmation timeout; proceeding with post-switchover
reconciliation — subscription DSNs may be stale") immediately before continuing
to the existing post-switchover reconciliation (i.e., before
ExecuteRefreshCurrentState is invoked), and include any timeout/error detail
available from the poll so the subsequent calls (ExecuteRefreshCurrentState,
ExecutePlanUpdate, persistPlans, applyPlans) are annotated with context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d599db3c-a0dd-4262-9ccc-f2b8b0e58097
📒 Files selected for processing (11)
e2e/database_test.goe2e/failover_test.goe2e/multi_node_replication_test.goe2e/replication_health_test.goe2e/switchover_test.goserver/internal/api/apiv1/post_init_handlers.goserver/internal/database/replication_slot_resource.goserver/internal/database/subscription_resource.goserver/internal/postgres/create_db.goserver/internal/workflows/failover.goserver/internal/workflows/switchover.go
84e2c15 to
8abb644
Compare
Summary
This PR adds a
db.WaitForReplicationcall after the first automatic switchover inTestSwitchoverScenariosto verify replication recovery after a topology change, and also adds the same check after the first automatic failover inTestFailoverScenarios.Changes
db.WaitForReplicationafter the first automatic switchover inTestSwitchoverScenariosdb.WaitForReplicationafter the first automatic failover inTestFailoverScenariosTesting
Verification:
Checklist
PLAT-612