Skip to content

[slopfix] refactor(universaldb,ups): single/multi-node udb over nats, remove ups postgres driver#5354

Open
MasterPtato wants to merge 1 commit into
stack/slopfix-perf-universaldb-fold-postgres-drain_batch-into-claim-nextval-and-single-apply-cte-pnzolyqxfrom
stack/slopfix-refactor-universaldb-ups-single-multi-node-udb-over-nats-remove-ups-postgres-driver-vuorlwlu
Open

[slopfix] refactor(universaldb,ups): single/multi-node udb over nats, remove ups postgres driver#5354
MasterPtato wants to merge 1 commit into
stack/slopfix-perf-universaldb-fold-postgres-drain_batch-into-claim-nextval-and-single-apply-cte-pnzolyqxfrom
stack/slopfix-refactor-universaldb-ups-single-multi-node-udb-over-nats-remove-ups-postgres-driver-vuorlwlu

Conversation

@MasterPtato

@MasterPtato MasterPtato commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

MasterPtato commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

This PR refactors UniversalDB's commit path from a Postgres-queue model to a NATS-based request/response model, and removes the UPS Postgres pub/sub driver entirely. It also migrates single-node vs. multi-node topology selection out of config and into a runtime startup gate. Net: −2515 / +1198 lines, removing a significant amount of Postgres-polling infrastructure in favour of NATS pub/sub.

Update: re-reviewed the latest revision (head fc555e62b). All 8 findings from the previous review are still present and unaddressed. Two related findings added below.


Findings

1. wait_for_leader polls every 50 ms despite a watch::Sender already being wired — CLAUDE.md violation
engine/packages/universaldb/src/driver/postgres/commit.rswait_for_leader (~line 172)

Still a loop { check; tokio::time::sleep(LEADER_POLL_INTERVAL).await } against shared.current_lease(). PostgresShared already holds a watch::Sender<Option<LeaseInfo>> (lease_tx) that set_lease writes to on every election result. The waiter could do lease_rx.changed().await inside a tokio::select! with a deadline arm for instant, zero-poll wakeup. As-is, during a leader election window a follower commit burns many 50 ms sleep iterations, multiplied by the outer retry loop in submit_nats.

CLAUDE.md rule: "Reserve tokio::time::sleep for … per-call timeouts via tokio::select!, … sleep_until(deadline) arms in an event-select loop. If it is inside a loop { check; sleep } body, it is polling and should be event-driven instead."


2. NATS watermark publish failure is logged at debug — should be warn
engine/packages/universaldb/src/driver/postgres/resolver/mod.rs — line 628

if let Err(err) = nats.client.publish(nats.subjects.watermark(), ...).await {
    tracing::debug!(?err, "failed to publish udb watermark");
}

When this publish fails, follower nodes don't learn the new commit watermark and fall back to polling the lease row, widening conflict ranges and causing spurious rejections on follower commits, with no operator-visible signal at any log level above debug. This should be tracing::warn! to surface NATS connectivity problems in production.

Related, same class: resolver/mod.rs handoff() (~line 225) logs the election-wake publish failure at debug too. A lost wake here silently falls back to the ELECTION_RETRY backstop with no operator-visible signal.


3. Partial (or fully absent) NATS credentials silently connect unauthenticated
engine/packages/universaldb/src/driver/postgres/nats.rsconnect() (~line 68)

let mut options = match (&config.username, &config.password) {
    (Some(username), Some(password)) => { ConnectOptions::with_user_and_password(...) }
    _ => async_nats::ConnectOptions::new(),   // no warning logged
};

Setting only nats.username (without nats.password) silently falls through to the unauthenticated path with no tracing::warn!. A warning log for the (Some, None) / (None, Some) case, or validating that credentials are all-or-nothing in config, would catch this misconfiguration early. Also worth checking: config.addresses isn't validated non-empty before being passed to the client, so an empty address list would fail opaquely rather than with an actionable config error.


4. Step-down while a NATS commit request is in-flight leaves the follower with no error reply
engine/packages/universaldb/src/driver/postgres/nats.rsrun_commit_subscriber (~line 127)

if jobs_tx.send(job).await.is_err() {
    break;   // channel closed on step-down; no NATS reply sent
}

When jobs_tx is closed because the node stepped down, the subscriber exits cleanly but the follower receives no response and must wait the full REQUEST_TIMEOUT before timing out and retrying. Extracting the reply subject before the send would allow an explicit NATS error reply, turning a silent timeout into an immediate retryable error.


5. TransactionConflictTracker::len() removed — conflict tracker growth is no longer observable
engine/packages/universaldb/src/conflict_tracker.rs

The len() method and the tracker_len structured log field on each batch were the only way to observe that the conflict tracker was growing unboundedly. A long-running transaction that never commits causes per-commit scan time to grow with tracker size. Without a metric or log field, this silently inflates batch latency with no diagnostic signal. Restoring a tracker_len field (even at tracing::debug!) in the batch completion log would preserve observability at zero production cost.


6. Hand-rolled fnv1a_64 with no tests
engine/packages/universaldb/src/driver/postgres/nats.rsfnv1a_64

A bespoke FNV-1a implementation derives the per-node NATS subject prefix from the node's Postgres row ID. If the constants are ever wrong or the function is "optimized", two engine versions would compute different subjects, and every multi-node commit would time out silently with no diagnostic beyond a REQUEST_TIMEOUT per attempt. Consider a unit test that pins the output for a known input, or using the fnv crate.


7. RUST_LOG=universaldb::driver::postgres=debug left in all dev compose files
self-host/dev/docker-compose.yml, self-host/dev-host/docker-compose.yml, self-host/dev-multidc/docker-compose.yml, self-host/dev-multidc-multinode/**, self-host/dev-multinode/** (generated from self-host/compose/template/src/docker-compose.ts)

A development debug override for the new module was added to the compose template and regenerated into all flavours (14 occurrences). Single-node and dev-multidc environments have no NATS and will produce zero output from this logger, but it's still noise in the config. In environments that do use NATS, DEBUG-level logs for every commit batch will flood the log stream. This line should be removed from the template (or scoped only to the multinode flavour) before merge.


8. wait_for_election_retry creates a new NATS subscription on every retry loop iteration
engine/packages/universaldb/src/driver/postgres/resolver/mod.rs (~line 194-210, called from the loop in run_multi_node ~line 166)

The function subscribes to the election subject, awaits one wakeup (or the ELECTION_RETRY timeout), and then the Subscriber drops, triggering an unsubscribe round-trip. On the next retry, a new subscription is created. Holding one subscription for the duration of the candidate phase would avoid the per-iteration subscribe/unsubscribe churn and the small race window between subscribe completing and a broadcast arriving.


Note on schema versioning

engine/sdks/schemas/universaldb-commit/v1.bare is extended in place (new clientNodeId/clientSeq fields on CommitRequest, new CommitReply/Watermark types) rather than adding a new version. I checked whether this violates the "never modify a published .bare schema version" rule: v1.bare was only introduced in an unmerged sibling stack PR with no history on main, so it has never shipped and this is not a violation.

@MasterPtato MasterPtato force-pushed the stack/slopfix-refactor-universaldb-ups-single-multi-node-udb-over-nats-remove-ups-postgres-driver-vuorlwlu branch from 077ad03 to fc555e6 Compare July 1, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant