Skip to content

[slopfix] perf(universaldb): fold postgres drain_batch into claim+nextval and single apply CTE#5353

Open
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-high-perf-universaldb-batch-leader-apply-and-fold-follower-commit-round-trips-vtrtqpypfrom
stack/slopfix-perf-universaldb-fold-postgres-drain_batch-into-claim-nextval-and-single-apply-cte-pnzolyqx
Open

[slopfix] perf(universaldb): fold postgres drain_batch into claim+nextval and single apply CTE#5353
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-high-perf-universaldb-batch-leader-apply-and-fold-follower-commit-round-trips-vtrtqpypfrom
stack/slopfix-perf-universaldb-fold-postgres-drain_batch-into-claim-nextval-and-single-apply-cte-pnzolyqx

Conversation

@MasterPtato

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

MasterPtato commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review: [slopfix] perf(universaldb): fold postgres drain_batch into claim+nextval and single apply CTE

This PR eliminates a round-trip in the Postgres drain path by inlining nextval into the claim SELECT, collapses the watermark advance + kv writes + status stamp into a single apply CTE, converts the conflict tracker from a Vec scanned linearly to a BTreeMap<commit_version, …> that can be range-skipped, and adds sub-phase timing instrumentation. The remove() bug fix (rocksdb path was passing start_version to a BTreeMap keyed by commit_version) is a real correctness fix.


Findings

engine/packages/universaldb/src/driver/postgres/resolver/mod.rs:382batch_ms metric semantics shifted, making before/after comparisons misleading

In the base branch, batch_start was placed after the claim SELECT but before the separate nextval allocation query. So old batch_ms captured: nextval_RTT + decode + conflict + fold + apply + stamp + watermark + commit + notify. The new code folds claim+nextval into one query and places batch_start after it, so new batch_ms only covers: decode + conflict + fold + apply(CTE) + commit + notify. Anyone watching batch_ms dashboards after deploy will see an apparent drop equal to the old nextval RTT, but that latency was eliminated by consolidation—it's not captured in any sub-phase metric either. The actual gain (one fewer Postgres RTT) is invisible to both old and new batch_ms. Consider moving batch_start to just before the claim query (or adding a claim_us timer for the combined query) so the metric's scope is unambiguous.

engine/packages/universaldb/src/driver/postgres/resolver/mod.rs:581tracker_len is sampled after notify_after_commit, possibly inflating the count

tracker_len = tracker.len().await is called after notify_after_commit(&conn, new_durable, &replies).await, which can yield. Any concurrent batch that completes its check_and_insert calls during that window will have inserted entries into the shared tracker before the len() call returns. The logged tracker_len can overcount entries attributable to this batch. Moving the sample to just before notify_after_commit makes it accurate to the batch being logged.

engine/packages/universaldb/src/driver/postgres/resolver/mod.rs:569 — no tracker.remove() if txn.commit() fails (pre-existing)

In the RocksDB path, a txn.commit() failure correctly removes the winner's entry from the tracker (line 407 of transaction_task.rs). In the Postgres drain path there is no equivalent cleanup: if txn.commit() on line 569 errors, all winners inserted into the conflict tracker via check_and_insert linger for up to TXN_CONFLICT_TTL (10 s), causing spurious conflict responses for any subsequent transaction whose ranges overlap. This predates the PR but the PR doesn't address it; noting it since the tracker refactor changes the removal API.

engine/packages/universaldb/src/driver/postgres/resolver/mod.rs:522 — CTE disjointness relies on undocumented WriteSet invariant without a debug assertion

The comment at this line states "apply::WriteSet guarantees each key appears at most once across upserts and point_deletes." That invariant is what makes folding pdel and up into one CTE safe—if a key appeared in both, the two sub-statements would race on the same row under Postgres CTE same-snapshot semantics with unpredictable results. The invariant holds today, but it is not guarded by a debug_assert! at the dispatch site. A future regression in fold_winners that puts a key in both sets would be silently non-deterministic rather than an immediate panic.


What looks good

  • The Bound::Excluded(txn1_start_version) range bound is logically equivalent to the removed txn1_start_version < txn2.commit_version condition—no correctness regression there.
  • Column index 4 for the inlined nextval alias is correct (id=0, read_version=1, payload=2, reply_channel=3, cv=4).
  • Pruning by walking first_key_value / pop_first is sound under NLL and correct given that commit_version is monotone with insert_instant within a batch.
  • The range-delete-before-CTE split is necessary and correctly preserved: a range clear covering a re-upserted key must commit before the upsert to avoid same-snapshot ambiguity.
  • The rocksdb remove() fix (line 407 of transaction_task.rs) is correct: commit_version is in scope and matches the BTreeMap key.
  • Empty-array edge cases (ANY($1::bytea[]), multi-arg unnest) are all no-ops in Postgres.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant