Skip to content

ACE-186: mtree init publication ordering + exclude metadata from spock#119

Open
danolivo wants to merge 8 commits into
mainfrom
ace-186
Open

ACE-186: mtree init publication ordering + exclude metadata from spock#119
danolivo wants to merge 8 commits into
mainfrom
ace-186

Conversation

@danolivo
Copy link
Copy Markdown
Contributor

@danolivo danolivo commented May 15, 2026

Summary

Fixes the SQLSTATE 42704 - publication "ace_mtree_pub" does not exist error during ace mtree table-diff, plus two BuildMtree bugs surfaced while testing.

Root causes

  1. Init ordering. MtreeInit was creating the replication slot before the publication was committed to WAL, so the slot's consistent_point preceded the publication. pgoutput's get_publication_oid then failed during change-callback replay. This is a pure pgoutput correctness issue that applies to any logical-replication setup, spock or vanilla.

  2. Spock DDL replication. ACE's CREATE TABLE / CREATE PUBLICATION / CREATE OR REPLACE FUNCTION statements on its private pgedge_ace schema were being DDL-replicated to peer nodes. That produced a cascade of cross-node interactions:

  • ace_cdc_metadata got auto-added to peer nodes' default repsets via spock's CREATE TABLE event trigger,
    leaking per-node LSNs cross-node and clobbering start_lsn.
  • CREATE OR REPLACE FUNCTION pgedge_ace.bytea_xor raced on pg_proc when n1's build replicated to n2 while
    n2's build ran the same statement (SQLSTATE XX000 "tuple concurrently updated").
  • n2's SetupPublication DROP+CREATE replicated back to n1 and landed in n1's pgoutput stream mid-replay,
    producing the original 42704.

Fix shape

Init-ordering split (commit 1): Split MtreeInit into three phases per node — Phase A commits schema + metadata table + publication and captures the publication's commit LSN; Phase B creates the slot (consistent point is now strictly after the publication); Phase C persists slot/start_lsn/pub_commit_lsn. A new pub_commit_lsn column (additive, ALTER TABLE ... ADD IF NOT EXISTS) lets processReplicationStream refuse to start a stream whose start_lsn predates it, surfacing an actionable error rather than the cryptic 42704. Empty pub_commit_lsn (legacy metadata rows) skips the check with a warning.

Spock DDL suppression (commit 5): Set spock.enable_ddl_replication = off and spock.include_ddl_repset = off as connection startup options on every ACE pool in applyRuntimeParams. ACE issues DDL exclusively against pgedge_ace and the ace_mtree_pub publication — never against user objects — so suppressing replication of its own DDL is correct unconditionally. Both GUCs are PGC_USERSET in spock so connection-level options override cluster defaults; both have dotted names so vanilla-PG instances without spock accept them as placeholder custom vars with no effect (dual-mode safe).

acquired conn to be released — but the tx's conn won't be released until the deferred tx.Rollback runs, and that defer cannot run until the function returns, which is blocked by pool.Close(). Classic defer-vs-explicit-close deadlock. Any failure inside BuildMtree's per-node body hung the process and swallowed the error. Wrapped the per-iteration body in a closure so defers are scoped per-iteration and fire in LIFO order (tx.Rollback first, then pool.Close).

BuildMtree redundant DDL (commit 4): createMtreeObjects was calling CreateSchema and CreateXORFunction at the start of every node's build, leftover from the original BuildMtree (July 2025, predating MtreeInit). MtreeInit's Phase A creates both, and BuildMtree implicitly requires init to have run. The redundancy was harmless on vanilla PG but racy on spock — the redundant CREATE OR REPLACE FUNCTION was the specific call hitting the XX000 race above. Removed both redundant calls. CREATE TABLE IF NOT EXISTS for ace_mtree_metadata stays — IF NOT EXISTS short-circuits without touching pg_class.

Regression coverage

tests/integration/cdc_init_ordering_test.go:

  • TestMtreeInitSlotIsAfterPublication — asserts the slot consistent point ≥ publication commit LSN on
    every node after init.
  • TestProcessReplicationStreamRejectsStaleStartLSN — corrupts start_lsn to a value below pub_commit_lsn
    and asserts UpdateFromCDC returns the new guard's actionable error.

Both run against the existing spock cluster fixture; no fixture changes required.

Upgrade notes

Existing broken installs are not auto-repaired. Users hitting the 42704 must run:

ace mtree teardown -d '<schema.table>'
ace mtree init -d
ace mtree build -d '<schema.table>'

If teardown itself fails, drop manually on every node:

DROP TABLE IF EXISTS pgedge_ace.ace_mtree__

CASCADE;
DROP TABLE IF EXISTS pgedge_ace.ace_cdc_metadata CASCADE;
DROP TABLE IF EXISTS pgedge_ace.ace_mtree_metadata CASCADE;
DROP PUBLICATION IF EXISTS ace_mtree_pub;
SELECT pg_drop_replication_slot('ace_mtree_slot') WHERE EXISTS (SELECT 1 FROM pg_replication_slots WHERE
slot_name = 'ace_mtree_slot');

--skip-cdc remains a valid interim workaround for pre-upgrade clusters.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds pub_commit_lsn to CDC metadata, captures publication commit LSN during node init, enforces a publication-commit ordering guard at replication stream startup, isolates metadata writes from Spock repsets via session-scoped repair mode and connection GUCs, and adds integration tests validating ordering and guards.

Changes

CDC Publication-Commit Ordering and Isolation

Layer / File(s) Summary
SQL Templates and Query Functions for Publication-Commit Tracking
db/queries/templates.go, db/queries/queries.go
New SQL templates and query functions add pub_commit_lsn support, InitCDCMetadata upsert with pub_commit_lsn, CurrentWalInsertLSN, and update getCDCMetadata and the metadata table schema to include pub_commit_lsn.
Node Initialization with Publication-Commit Capture
internal/consistency/mtree/merkle.go
MtreeInit delegates per-node setup to initOneNode, which captures pub_commit_lsn during publication creation, creates replication slots outside transactions, and persists metadata under session-scoped spock.repair_mode with cleanup on failures.
Publication-Commit Guard and Metadata-Update Behavior
internal/infra/cdc/listen.go
Stream processing now reads pub_commit_lsn, warns on legacy empty values, rejects stale start_lsn values older than pub_commit_lsn, and removes confirmed_flush_lsn-based backward-move logic.
Spock Helper Functions and Connection Params
internal/infra/cdc/setup.go, internal/infra/db/auth.go
Adds ExcludeMetadataFromSpockRepsets and SetSpockRepairMode helpers (no-ops if Spock absent) and injects Spock GUCs (spock.enable_ddl_replication=off, spock.include_ddl_repset=off) on ACE connections.
Test Updates and Integration Tests
tests/integration/cdc_busy_table_test.go, tests/integration/cdc_init_ordering_test.go
Adjusted call sites for expanded GetCDCMetadata signature and added integration tests asserting slot confirmed_flush_lsn >= pub_commit_lsn and that the replication stream rejects stale start_lsn relative to publication commit LSN.

Poem

Little rabbit hums beneath the nodes,
I mark the commit LSN where history grows.
Repair mode snug, the metadata stays near,
Repsets drop their tables, quiet and clear.
Tests hop in to prove the order here. 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'ACE-186: mtree init publication ordering + exclude metadata from spock' directly addresses the two root causes fixed: publication ordering in MtreeInit and excluding metadata from Spock replication, matching the primary changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly explains the root causes (init ordering and Spock DDL replication), the fix approach across multiple commits, regression test coverage, and upgrade notes—all directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ace-186

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@danolivo danolivo requested a review from mason-sharp May 15, 2026 12:09
@danolivo danolivo changed the title fix: mtree init publication ordering + exclude metadata from spock ACE-186: mtree init publication ordering + exclude metadata from spock May 15, 2026
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 15, 2026

Up to standards ✅

🟢 Issues 2 medium

Results:
2 new issues

Category Results
Complexity 2 medium

View in Codacy

🟢 Metrics 8 complexity · 4 duplication

Metric Results
Complexity 8
Duplication 4

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
db/queries/templates.go (1)

376-387: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Split this multi-statement template into separate Exec calls.

The template emits CREATE TABLE ...; ALTER TABLE ... as a single SQL string. pgx v5 uses the extended protocol by default, which does not support multiple statements in a single Exec() call. This will fail at runtime. Either split into two separate templates with two Exec calls, or explicitly configure the connection to use QueryExecModeSimpleProtocol (not found in the codebase).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@db/queries/templates.go` around lines 376 - 387, The template
CreateCDCMetadataTable currently emits two SQL statements ("CREATE TABLE ...;"
and "ALTER TABLE ...;") in one string which fails under pgx v5 extended
protocol; split this into two separate templates and Exec calls: keep the CREATE
TABLE statement in CreateCDCMetadataTable (or a renamed
createAceCDCMetadataTable) and move the ALTER TABLE ... ADD COLUMN IF NOT EXISTS
pub_commit_lsn into a new template (e.g., AddPubCommitLsnToAceCDCMetadata) and
call Exec for each template separately, referencing {{aceSchema}} and
ace_cdc_metadata in both so they run sequentially at migration time.
🧹 Nitpick comments (1)
internal/infra/cdc/setup.go (1)

90-99: 🏗️ Heavy lift

Consider constraining the API signature to enforce session pinning.

All current call sites correctly use pinned transactions (tx from pool.Begin()), but the function signature accepts a generic queries.DBQuerier. While the wrapper at db/queries/queries.go:3007 documents that Set(true) and Set(false) must pair on the same session, the generic API allows future callers to pass a pool handle and inadvertently separate the toggles across sessions, poisoning one connection.

Consider adding compile-time enforcement (e.g., a type wrapper or interface constraint) to make session-pinning explicit and prevent misuse.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/infra/cdc/setup.go` around lines 90 - 99, SetSpockRepairMode
currently accepts a generic queries.DBQuerier which lets callers pass a pool
handle and accidentally toggle repair mode on different sessions; change the API
to require a session-bound type (e.g., accept the transaction/session interface
used by pool.Begin() such as a queries.DBTx or a new interface like
queries.SessionQuerier) so SetSpockRepairMode(ctx context.Context, session
queries.SessionQuerier, on bool) enforces compile-time session pinning; update
all callers to pass the tx returned from Begin() and call
queries.CheckSpockInstalled and queries.SetSpockRepairMode on that session so
Set(true)/Set(false) always occur on the same connection.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/consistency/mtree/merkle.go`:
- Around line 866-870: The call to queries.CurrentWalInsertLSN that sets
pubCommitLSN is happening while Phase A's transaction (tx) is still open; move
the WAL LSN capture so it occurs after Phase A commits and before Phase B starts
so the read reflects the commit LSN. Concretely: finish and commit the Phase A
transaction first, then call queries.CurrentWalInsertLSN (using an appropriate
context/connection outside the committed tx or a new read-only tx) to set
pubCommitLSN, and only then proceed to Phase B logic (the code that relies on
pubCommitLSN and uses tx/slot-guard checks).

In `@internal/infra/cdc/listen.go`:
- Around line 467-489: After calling SetSpockRepairMode(processingCtx, tx, true)
you must immediately schedule cleanup so session-scoped repair_mode is always
disabled; add defer SetSpockRepairMode(processingCtx, tx, false) right after the
successful true call (before calling queries.UpdateCDCMetadata or committing tx)
so any early return or error path will run the disable. Update or remove the
later explicit SetSpockRepairMode(..., false) calls as needed to avoid
double-disabling or handle their errors gracefully; ensure the deferred call
runs regardless of conn.Close or tx.Commit failures and continues to return the
connection to the pool in a neutral state.

---

Outside diff comments:
In `@db/queries/templates.go`:
- Around line 376-387: The template CreateCDCMetadataTable currently emits two
SQL statements ("CREATE TABLE ...;" and "ALTER TABLE ...;") in one string which
fails under pgx v5 extended protocol; split this into two separate templates and
Exec calls: keep the CREATE TABLE statement in CreateCDCMetadataTable (or a
renamed createAceCDCMetadataTable) and move the ALTER TABLE ... ADD COLUMN IF
NOT EXISTS pub_commit_lsn into a new template (e.g.,
AddPubCommitLsnToAceCDCMetadata) and call Exec for each template separately,
referencing {{aceSchema}} and ace_cdc_metadata in both so they run sequentially
at migration time.

---

Nitpick comments:
In `@internal/infra/cdc/setup.go`:
- Around line 90-99: SetSpockRepairMode currently accepts a generic
queries.DBQuerier which lets callers pass a pool handle and accidentally toggle
repair mode on different sessions; change the API to require a session-bound
type (e.g., accept the transaction/session interface used by pool.Begin() such
as a queries.DBTx or a new interface like queries.SessionQuerier) so
SetSpockRepairMode(ctx context.Context, session queries.SessionQuerier, on bool)
enforces compile-time session pinning; update all callers to pass the tx
returned from Begin() and call queries.CheckSpockInstalled and
queries.SetSpockRepairMode on that session so Set(true)/Set(false) always occur
on the same connection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5811b76d-c669-4f27-abcf-7fa388daaab6

📥 Commits

Reviewing files that changed from the base of the PR and between a1b3388 and d0e483c.

📒 Files selected for processing (7)
  • db/queries/queries.go
  • db/queries/templates.go
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • internal/infra/cdc/setup.go
  • tests/integration/cdc_busy_table_test.go
  • tests/integration/cdc_init_ordering_test.go

Comment thread internal/consistency/mtree/merkle.go
Comment thread internal/infra/cdc/listen.go Outdated
@ibrarahmad ibrarahmad self-requested a review May 16, 2026 02:20
Copy link
Copy Markdown
Contributor

@ibrarahmad ibrarahmad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diagnosis and ordering invariant look sound; tests cover the right invariants. Notes inline. PR-level nit: the body says "two commits (fix-first, then test)" but the branch has four; either update the description or rebase back to two for a bisect-clean history.

Comment thread internal/infra/cdc/listen.go Outdated
Comment thread internal/consistency/mtree/merkle.go
Comment thread internal/infra/cdc/listen.go
Comment thread internal/infra/cdc/setup.go Outdated
Comment thread tests/integration/cdc_init_ordering_test.go Outdated
Comment thread db/queries/templates.go Outdated
danolivo pushed a commit that referenced this pull request May 19, 2026
…ming)

Round of fixes for the human review comments on PR #119:

- flushMetadata (listen.go:542): same pool-poisoning hole as Site 1.
  Port to the borrowed-*pgxpool.Conn pattern so Set(false) is deferred
  on the live session before Release. This site fires every 10 s on
  the periodic-flush path AND on shutdown, so it was the highest-risk
  remaining leak.

- MtreeInit Phase C (merkle.go:884): same hole. Borrow the conn,
  defer Set(false) on it, run the InitCDCMetadata write through a
  tx opened on the same conn.

- Guard scope comment (listen.go:103): clarify that the pub_commit_lsn
  guard is a tripwire for partial-state-on-upgrade and for cross-node
  leaks whose LSN lands below local pub_commit_lsn — NOT a general
  leak detector. The actual defenses against cross-node leak are
  repset exclusion + repair_mode wrapping every metadata write.
  Reader-of-the-future shouldn't mistake this for an end-to-end check.

- Test timing (cdc_init_ordering_test.go:186): bump the
  sentinel-propagation window from 3 s to 10 s. 3 s was tight under
  load on slow CI; 10 s gives Spock apply latency real headroom
  without making the test materially slower in the common case.

Two non-blocking comments deliberately deferred to follow-ups:
  - CheckSpockInstalled cache on the pool (~12 RTTs/min savings)
  - Hoist 'ace_cdc_metadata' relname into a shared constant

CodeRabbit's pubCommitLSN-timing finding remains skipped: the invariant
chain captured_mid_tx < commit_lsn(Phase A) <= slot.consistent_point
holds for freshly-created slots, which is the only state Phase B
produces. The hypothetical "broken slot whose consistent_point sits
between captured and commit" cannot arise from this code path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/infra/cdc/listen.go`:
- Around line 568-575: The deferred reset of spock.repair_mode in flushMetadata
currently uses the caller's cancelable ctx which can be canceled before defer
runs; change the defer to use a non-cancelable context (like the pattern in
processReplicationStream) so the reset always runs: keep the initial call
SetSpockRepairMode(ctx, c, true) as-is, then create a non-cancelable context
(e.g. ctxNoCancel := context.WithoutCancel(ctx)) and call
SetSpockRepairMode(ctxNoCancel, c, false) inside the defer; reference
SetSpockRepairMode and flushMetadata to locate and update the defer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 582e9095-53ab-40a8-a71b-ea48d3c87de7

📥 Commits

Reviewing files that changed from the base of the PR and between d0e483c and 5862fa7.

📒 Files selected for processing (3)
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • tests/integration/cdc_init_ordering_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/cdc_init_ordering_test.go
  • internal/consistency/mtree/merkle.go

Comment thread internal/infra/cdc/listen.go Outdated
@danolivo danolivo force-pushed the ace-186 branch 3 times, most recently from dbd3724 to dcc6d9c Compare May 19, 2026 09:42
@danolivo danolivo requested a review from ibrarahmad May 19, 2026 10:06
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/cdc_init_ordering_test.go (1)

184-186: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale comment: mentions "3 s window" but code uses 10 seconds.

The comment references the original 3-second timeout, but the deadline was bumped to 10 seconds per prior review feedback.

📝 Suggested fix
 // assertSentinelDoesNotPropagate writes a sentinel start_lsn on n1 and
-// verifies it never appears on n2 within a 3 s window. Under bug 2 the
+// verifies it never appears on n2 within a 10 s window. Under bug 2 the
 // sentinel would propagate via Spock apply within seconds.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/cdc_init_ordering_test.go` around lines 184 - 186, Update
the stale comment above assertSentinelDoesNotPropagate to reflect the current 10
second timeout (instead of "3 s window"); locate the comment that mentions "3 s
window" in the assertSentinelDoesNotPropagate test and change it to state "10 s
window" or "10-second window" so it matches the test's actual deadline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/integration/cdc_init_ordering_test.go`:
- Around line 184-186: Update the stale comment above
assertSentinelDoesNotPropagate to reflect the current 10 second timeout (instead
of "3 s window"); locate the comment that mentions "3 s window" in the
assertSentinelDoesNotPropagate test and change it to state "10 s window" or
"10-second window" so it matches the test's actual deadline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2d4b8b7-a44a-434f-9397-23eb7b69ae3e

📥 Commits

Reviewing files that changed from the base of the PR and between 5862fa7 and 5f50b9a.

📒 Files selected for processing (6)
  • db/queries/queries.go
  • db/queries/templates.go
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • internal/infra/cdc/setup.go
  • tests/integration/cdc_init_ordering_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@db/queries/queries.go`:
- Around line 2919-2928: GetCDCMetadata currently fails when the
ace_cdc_metadata table lacks pub_commit_lsn; modify the function
(GetCDCMetadata) to detect Postgres "undefined_column" (SQLSTATE 42703) from the
db.QueryRow(...).Scan(...) error and treat it as a non-fatal condition: if the
error is a Postgres error with Code == "42703" (use errors.As to check for the
driver-specific error type such as *pgconn.PgError or *pq.Error depending on the
codebase), return slotName, startLSN, tables and an empty pubCommitLSN with nil
error so callers can run the legacy fallback; otherwise return the original
error as before. Ensure the detection targets the error code "42703"
specifically.

In `@internal/infra/cdc/listen.go`:
- Around line 113-126: Update the recovery guidance text in the ace_cdc_metadata
LSN checks so it includes "ace mtree build" alongside "ace mtree teardown" and
"ace mtree init"; specifically change the message passed to logger.Warn (when
pubCommitLSNStr == "") and the error returned in the startLSN < pubCommitLSN
branch where the fmt.Errorf is created (these are the strings used in
logger.Warn, logger.Error and the fmt.Errorf in listen.go) to list "ace mtree
teardown", "ace mtree init", and "ace mtree build" in that order so users are
directed to run build after init.

In `@internal/infra/cdc/setup.go`:
- Around line 68-71: The schema-qualified table name constructed in the variable
qualified (currently fmt.Sprintf("%s.ace_cdc_metadata", aceSchema)) must use
proper SQL identifier quoting so Spock sees the exact relation; update the code
that builds qualified before calling queries.RepsetRemoveTable(ctx, db, set,
qualified) to quote aceSchema (and the ace_cdc_metadata identifier) with the
project’s identifier-quoting helper (or a safe QuoteIdentifier function) and
then pass the quoted qualified name (e.g., quotedSchema + "." + quotedTable) to
RepsetRemoveTable so identifiers with capitals or special chars resolve
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 81b5d968-23d1-46f0-bf8b-f18b6c962f24

📥 Commits

Reviewing files that changed from the base of the PR and between 5f50b9a and 51b5c1c.

📒 Files selected for processing (7)
  • db/queries/queries.go
  • db/queries/templates.go
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • internal/infra/cdc/setup.go
  • tests/integration/cdc_busy_table_test.go
  • tests/integration/cdc_init_ordering_test.go

Comment thread db/queries/queries.go
Comment thread internal/infra/cdc/listen.go Outdated
Comment thread internal/infra/cdc/setup.go Outdated
Andrei Lepikhov and others added 5 commits May 24, 2026 19:04
Fixes 'publication "ace_mtree_pub" does not exist' (SQLSTATE 42704)
during 'ace mtree table-diff'.

Init order: the replication slot was being created before the
publication was committed to WAL. The slot's consistent point
therefore preceded the publication, and pgoutput's
get_publication_oid failed during change-callback replay.
MtreeInit is now split into three phases per node: Phase A commits
schema + metadata table + publication and captures the publication
commit LSN; Phase B creates the slot (its consistent point is now
strictly after the publication); Phase C persists slot/start_lsn/
pub_commit_lsn.

A new pub_commit_lsn column is added to ace_cdc_metadata (additive,
ALTER TABLE IF NOT EXISTS). processReplicationStream refuses to open
a stream whose start_lsn is older than pub_commit_lsn, returning an
actionable error instead of silently rewinding to a pre-publication
LSN. Empty pub_commit_lsn (legacy metadata rows) skips the check
with a warning.

Existing broken installs are not auto-repaired; users must run
'ace mtree teardown' + 'ace mtree init' + 'ace mtree build' after
upgrade. '--skip-cdc' remains a valid interim workaround for
pre-upgrade clusters.
Two integration tests against the spock cluster fixture:

- TestMtreeInitSlotIsAfterPublication: after MtreeInit, every node's
  slot consistent_point must be >= the publication's commit LSN
  recorded in ace_cdc_metadata.pub_commit_lsn. This is the ordering
  invariant the 3-phase init split restores.

- TestProcessReplicationStreamRejectsStaleStartLSN: corrupts n1's
  start_lsn to a value below pub_commit_lsn and asserts UpdateFromCDC
  returns the new guard's actionable error rather than the cryptic
  SQLSTATE 42704 that pgoutput would otherwise emit.
BuildMtree's per-node body opens a tx with defer tx.Rollback in the
outer function scope, then on any error explicitly calls pool.Close()
and returns the error. pool.Close() waits on puddle's WaitGroup for
every acquired conn to be released — but the conn the tx holds will
not be released until the deferred tx.Rollback runs, and that defer
cannot run until the function returns, which is being blocked by
pool.Close(). Classic Go defer-vs-explicit-close deadlock: the
process hangs and the underlying error never surfaces.

The shape repeats at every error site inside the per-iteration body
(AlterPublicationAddTable, GetCDCMetadata, UpdateCDCMetadata,
createMtreeObjects, insertBlockRanges, computeLeafHashes,
buildParentNodes, tx.Commit). All eight are reachable; any one being
hit hangs the build.

Wrap the per-iteration body in a closure so defers are scoped to one
iteration. defer pool.Close lands above defer tx.Rollback in
registration order, so LIFO fires tx.Rollback first (releases the
conn) then pool.Close (no acquired conns to wait for). The explicit
pool.Close calls inside each error branch go away — defers handle it
uniformly on success and failure.

Confirmed via SIGQUIT goroutine dump showing main blocked in
puddle.Pool.Close.deferwrap1 → WaitGroup.Wait, called from
BuildMtree's createMtreeObjects error branch.

The same defer-stacking pattern exists in the loop further down
BuildMtree (around the UpdateMtree block) and in a few other
*.Close sites elsewhere in this file. They don't deadlock today
because either no tx is held or the order happens to be safe, but
the closure pattern would make them uniformly correct. Left for
a follow-up so this commit stays focused on the reproduced bug.
createMtreeObjects has called CreateSchema and CreateXORFunction at
the start of every node's build iteration since the original
BuildMtree commit (a05445e, July 2025) — back when build was the
entry point and had to set up the schema and XOR function itself.
When CDC-based MtreeInit landed (fd02c30, August 2025), those calls
became redundant: init's Phase A creates both, and BuildMtree
implicitly requires init to have run (it consumes the CDC
publication, slot, and metadata that init sets up). The build-side
calls were never cleaned up.

The redundancy was harmless on vanilla PG. On Spock-replicated
clusters it surfaces as SQLSTATE XX000 "tuple concurrently updated":
CREATE OR REPLACE FUNCTION unconditionally writes pg_proc, and
Spock's DDL apply of n1's BuildMtree replicating to n2 races with
n2's BuildMtree running the same statement. The race was masked
before the previous commit by the BuildMtree pool.Close deadlock
that swallowed the error and hung the process.

CREATE TABLE IF NOT EXISTS for ace_mtree_metadata stays — IF NOT
EXISTS short-circuits without touching pg_class when the relation
already exists, so the same race doesn't apply.
ACE issues DDL exclusively against its own pgedge_ace schema and the
ace_mtree_pub publication — never against user objects — and every
piece of that DDL is intentionally per-node. Letting Spock replicate
it produces a chain of cross-node interactions:

- ace_cdc_metadata gets auto-added to peer nodes' default repsets by
  Spock's CREATE TABLE event trigger, leaking per-node LSNs cross-node.
- CREATE OR REPLACE FUNCTION on pgedge_ace.bytea_xor races on pg_proc
  when n1's build replicates to n2 while n2's build runs the same
  statement (SQLSTATE XX000).
- n2's SetupPublication DROP+CREATE replicates back to n1, landing
  in n1's pgoutput stream mid-replay and producing the original
  SQLSTATE 42704 "publication does not exist" the PR was meant to fix.

Set spock.enable_ddl_replication = off and spock.include_ddl_repset
= off as connection startup options on every ACE pool. Both are
PGC_USERSET in Spock, so connection-level options override cluster
defaults. Both are dotted custom GUCs, so PostgreSQL accepts them
on vanilla-PG instances without Spock loaded — the parameters
become placeholder custom variables with no effect, keeping the
dual-mode native-PG path intact.

Once this is verified in production, the per-DDL workarounds in
earlier commits (post-init repset cleanup sweep, SetSpockRepairMode
wrapping at every metadata-write site, ExcludeMetadataFromSpockRepsets
itself) all become removable. Leaving them in place for now as
defense-in-depth.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
internal/infra/cdc/listen.go (1)

113-126: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Include ace mtree build in the recovery guidance.

The supplied reviewer notes say broken installs need teardown, init, and build. Stopping the warning/error text at init still leaves users with an unbuilt tree after they follow the message.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/infra/cdc/listen.go` around lines 113 - 126, The user-facing
recovery guidance strings in the error and warning paths (logger.Warn call and
the fmt.Errorf returned when startLSN < pubCommitLSN in listen.go) currently
instruct to run 'ace mtree teardown' and 'ace mtree init' but omit the required
'ace mtree build' step; update both the logger.Warn message (the call to
logger.Warn referencing ace_cdc_metadata and publication) and the returned
fmt.Errorf message (the error constructed when startLSN < pubCommitLSN) to
append "and 'ace mtree build'" so the guidance reads "run 'ace mtree teardown',
'ace mtree init' and 'ace mtree build' to recover".
internal/consistency/mtree/merkle.go (1)

861-869: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't persist a pre-commit WAL position as pub_commit_lsn.

CurrentWalInsertLSN is still called while Phase A is open. The commit record lands later, so a stale start_lsn that falls between this captured value and the real commit LSN will bypass the new guard and can still reproduce SQLSTATE 42704. Capture the WAL position only after Phase A commits.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/consistency/mtree/merkle.go` around lines 861 - 869, The code
captures pubCommitLSN via CurrentWalInsertLSN while Phase A is still open,
risking a pre-commit WAL position; move the capture so it runs after the Phase A
transaction is committed: call tx.Commit(m.Ctx) first (or ensure the transaction
has committed) and only then call queries.CurrentWalInsertLSN(m.Ctx, tx) to set
pubCommitLSN, updating any error handling and the surrounding logic in merkle.go
(references: pubCommitLSN, CurrentWalInsertLSN, tx.Commit) so the persisted
pub_commit_lsn reflects the post-commit WAL position.
db/queries/queries.go (1)

2919-2928: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle pre-migration ace_cdc_metadata layouts before reading pub_commit_lsn.

This path still hard-fails on upgraded clusters whose metadata table predates the new column: the SELECT fails before callers can treat pubCommitLSN == "" as the legacy case. Catch undefined_column and retry the legacy 3-column read, or run the additive migration before every read.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@db/queries/queries.go` around lines 2919 - 2928, GetCDCMetadata currently
fails when the pub_commit_lsn column is missing on older schemas; modify
GetCDCMetadata to detect the undefined_column error (SQLSTATE 42703 or error
text containing "undefined_column") returned from db.QueryRow when using
SQLTemplates.GetCDCMetadata, and on that specific error re-run a legacy 3-column
query (e.g. render and QueryRow using a legacy template like
SQLTemplates.GetCDCMetadataLegacy or a SQL string selecting only slot_name,
start_lsn, tables), Scan into slotName, startLSN, tables, set pubCommitLSN = ""
and return nil; for any other error continue to return the original error.
Ensure you reference GetCDCMetadata, RenderSQL, SQLTemplates.GetCDCMetadata and
the DBQuerier.QueryRow/Scan calls when implementing the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/consistency/mtree/merkle.go`:
- Around line 1453-1535: During the per-node loop you only defer pool.Close()
for the node handled in the closure, leaving other entries in the pools map open
on early return; fix by removing each pool from the pools map once you take
ownership and also add an outer cleanup after the loop to close any remaining
pools. Specifically, inside the closure after obtaining pool :=
pools[nodeInfo["Name"].(string)] and deferring pool.Close(), call delete(pools,
nodeInfo["Name"].(string)) so the map no longer retains that pool; then after
the for _, nodeInfo := range m.ClusterNodes loop finishes (or on error return
path), iterate remaining entries in pools and call Close() on each to ensure
none are leaked. Refer to symbols: pools, nodeInfo, pool.Close(), the
per-iteration closure around the loop and m.ClusterNodes.

---

Duplicate comments:
In `@db/queries/queries.go`:
- Around line 2919-2928: GetCDCMetadata currently fails when the pub_commit_lsn
column is missing on older schemas; modify GetCDCMetadata to detect the
undefined_column error (SQLSTATE 42703 or error text containing
"undefined_column") returned from db.QueryRow when using
SQLTemplates.GetCDCMetadata, and on that specific error re-run a legacy 3-column
query (e.g. render and QueryRow using a legacy template like
SQLTemplates.GetCDCMetadataLegacy or a SQL string selecting only slot_name,
start_lsn, tables), Scan into slotName, startLSN, tables, set pubCommitLSN = ""
and return nil; for any other error continue to return the original error.
Ensure you reference GetCDCMetadata, RenderSQL, SQLTemplates.GetCDCMetadata and
the DBQuerier.QueryRow/Scan calls when implementing the change.

In `@internal/consistency/mtree/merkle.go`:
- Around line 861-869: The code captures pubCommitLSN via CurrentWalInsertLSN
while Phase A is still open, risking a pre-commit WAL position; move the capture
so it runs after the Phase A transaction is committed: call tx.Commit(m.Ctx)
first (or ensure the transaction has committed) and only then call
queries.CurrentWalInsertLSN(m.Ctx, tx) to set pubCommitLSN, updating any error
handling and the surrounding logic in merkle.go (references: pubCommitLSN,
CurrentWalInsertLSN, tx.Commit) so the persisted pub_commit_lsn reflects the
post-commit WAL position.

In `@internal/infra/cdc/listen.go`:
- Around line 113-126: The user-facing recovery guidance strings in the error
and warning paths (logger.Warn call and the fmt.Errorf returned when startLSN <
pubCommitLSN in listen.go) currently instruct to run 'ace mtree teardown' and
'ace mtree init' but omit the required 'ace mtree build' step; update both the
logger.Warn message (the call to logger.Warn referencing ace_cdc_metadata and
publication) and the returned fmt.Errorf message (the error constructed when
startLSN < pubCommitLSN) to append "and 'ace mtree build'" so the guidance reads
"run 'ace mtree teardown', 'ace mtree init' and 'ace mtree build' to recover".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d519589-d288-4641-bff9-3afbd4610262

📥 Commits

Reviewing files that changed from the base of the PR and between 51b5c1c and 93511c1.

📒 Files selected for processing (8)
  • db/queries/queries.go
  • db/queries/templates.go
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • internal/infra/cdc/setup.go
  • internal/infra/db/auth.go
  • tests/integration/cdc_busy_table_test.go
  • tests/integration/cdc_init_ordering_test.go
✅ Files skipped from review due to trivial changes (1)
  • internal/infra/db/auth.go

Comment thread internal/consistency/mtree/merkle.go
BuildMtree pre-builds a pool per node up front in the row-estimate
loop, then hands the populated map to the per-node build loop which
closes each pool only as its own iteration completes. If a later
iteration errors out and exits, pools for nodes that loop never
reached stay open for the rest of the process lifetime — a real leak
in HTTP API server invocations, mostly invisible under CLI invocation
where the process exits anyway.

Add a defer at the top of the function that closes every entry in
the pools map on exit. pgxpool.Pool.Close uses sync.Once internally,
so the per-iteration closes still take effect promptly and this
defer is a no-op for pools already closed by them.

Pre-existing leak; the closure refactor for the pool.Close deadlock
didn't introduce or remove it.
The pub_commit_lsn column is added by MtreeInit's Phase A via the
additive ALTER TABLE in CreateCDCMetadataTable. On clusters upgraded
to this binary without re-running init — say, going straight to
mtree build or mtree table-diff — the table still has the old
3-column layout. The previous SELECT referenced pub_commit_lsn as a
column, so PostgreSQL raised SQLSTATE 42703 ("column does not
exist") before the listen.go fallback for an empty pub_commit_lsn
could fire. CDC startup / build aborted instead of warning and
continuing.

Extract pub_commit_lsn via to_jsonb(row) ->> 'pub_commit_lsn'
instead of a direct column reference. The SQL parses against any
schema version; on legacy rows the JSON object has no key for the
column, ->> returns NULL, COALESCE produces the empty string the
listen.go guard already treats as "invariant uncheckable — warn and
skip". The next MtreeInit still backfills the real column via the
additive ALTER TABLE in CreateCDCMetadataTable.
The two recovery messages in processReplicationStream told users to
run 'ace mtree teardown' and 'ace mtree init' but stopped there.
After init, the merkle tree itself is empty; a subsequent
table-diff would either fail or produce incorrect results.
'ace mtree build' is the final step that repopulates the tree
and makes table-diff usable again. Updated both messages to spell
out the full three-step sequence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants