datajoint · dimitri-yatsenko · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -22,6 +22,8 @@ nav:
       - Storage:
           - Type System: explanation/type-system.md
           - Custom Codecs: explanation/custom-codecs.md
+      - Operations:
+          - PostgreSQL CDC and Replica Identity: explanation/postgresql-cdc-replication.md
   - Tutorials:
       - tutorials/index.md
       - Basics:
@@ -117,6 +119,8 @@ nav:
               - AutoPopulate: reference/specs/autopopulate.md
               - Job Metadata: reference/specs/job-metadata.md
               - Object Store Configuration: reference/specs/object-store-configuration.md
+          - Deployment:
+              - Deployment Operations: reference/specs/deploy-operations.md
       - Instance & Thread Safety:
           - Thread-Safe Mode: reference/specs/thread-safe-mode.md
       - Configuration: reference/configuration.md

diff --git a/src/explanation/postgresql-cdc-replication.md b/src/explanation/postgresql-cdc-replication.md
@@ -0,0 +1,104 @@
+# PostgreSQL CDC and Replica Identity
+
+This page explains how DataJoint integrates with **change-data-capture (CDC)** consumers on PostgreSQL — what `REPLICA IDENTITY` is, why some CDC tools require `FULL`, and how to configure a DataJoint schema for downstream replication.
+
+!!! version-added "New in DataJoint 2.3"
+    The `datajoint.deploy` module — including `set_replica_identity` — ships in DataJoint 2.3. PostgreSQL `REPLICA IDENTITY` configuration was not part of the framework in earlier releases; users on 2.2.x must apply the `ALTER TABLE` statements manually.
+
+For the normative API specification, see [Deployment Operations](../reference/specs/deploy-operations.md).
+
+## What is REPLICA IDENTITY?
+
+PostgreSQL records every data change in its write-ahead log (WAL). When a row is `UPDATE`d or `DELETE`d, the **old version** of the row must be representable in WAL so logical replication consumers can identify which row changed. `REPLICA IDENTITY` is a per-table setting that controls how much of the old row is logged:
+
+| Setting | Old row contents logged | WAL cost on UPDATE/DELETE |
+|---|---|---|
+| `DEFAULT` | Primary-key columns only | Minimal |
+| `FULL` | Entire old row, all columns | Higher (proportional to row width, including TOASTed columns) |
+| `NOTHING` | Nothing — disables logical replication for the table | n/a |
+| `USING INDEX` | Columns of a specified unique index | Between `DEFAULT` and `FULL` |
+
+`INSERT` is unaffected — the full new row is always written to WAL regardless of the setting. Only the **pre-image** (old row state) for updates and deletes is governed by `REPLICA IDENTITY`.
+
+The `ALTER TABLE … REPLICA IDENTITY …` statement is metadata-only and instant: it changes how subsequent updates are logged but does not rewrite any data. Re-applying the same setting is a no-op at the storage layer.
+
+## Why CDC consumers care
+
+Logical replication subscribers — including most CDC pipelines — reconstruct downstream tables (or event streams) from WAL. For deletes and updates, they need enough information about the old row to identify it on the other side. With `REPLICA IDENTITY DEFAULT`, the subscriber gets only the primary key — enough to match the row but **not enough to reconstruct its prior column values**.
+
+Most modern CDC tools work fine with `DEFAULT` when tables have a primary key:
+
+| CDC Tool | `FULL` required? | Notes |
+|---|---|---|
+| Debezium | No | Recommends `DEFAULT` for performance |
+| Azure CDC | No | Recommends against `FULL` |
+| ClickHouse ClickPipes | No | `DEFAULT` is fine when PK exists |
+| **Databricks Lakehouse Sync** | **Yes** | Tables without `FULL` are silently skipped — the table is not replicated, with no error |
+
+Databricks Lakehouse Sync is the load-bearing case. It needs the full pre-image to drive Slowly-Changing-Dimension (SCD Type 2) history downstream, and tables that lack `REPLICA IDENTITY FULL` are dropped from the sync entirely. There is no error or warning at sync time; the table simply does not appear in the destination. This silent-failure mode is what motivates DataJoint's first-class deploy helper.
+
+## Cost considerations
+
+The `ALTER` itself is free. The cost lives in subsequent WAL volume on `UPDATE` and `DELETE`. Under `FULL`, every modified row writes its **entire prior contents** to WAL — including any TOASTed `bytea` columns that were unchanged by the operation.
+
+For DataJoint's workload, this is usually negligible:
+
+- **Inserts are unaffected.** DataJoint pipelines are insert-append dominated; this is the common case.
+- **Updates are rare and surgical.** `update1()` is intended for occasional metadata corrections, not bulk modification.
+- **The notable scenario is bulk delete on tables with `<blob>` columns.** A delete of *N* rows × *B*-byte blobs writes ≈ *N × B* bytes of WAL. For a delete of 100 rows × 10 MB blobs, that's a ~1 GB WAL burst — transient (cleared at the next checkpoint) but real.
+
+The cost is bounded by what you actually update or delete, not by table size at rest. For pipelines that rarely modify data after insert, `FULL` is effectively free.
+
+## Compliance considerations
+
+Under `DEFAULT`, only primary-key values appear in WAL — so even if WAL is exposed to a wider audience than the database itself, sensitive non-key columns don't leak through replication channels.
+
+Under `FULL`, the entire row appears in WAL — including any column that may carry PHI, PII, or otherwise sensitive data. Whether this matters depends on **who can read WAL**:
+
+| Environment | WAL access | Practical risk of `FULL` |
+|---|---|---|
+| Self-hosted PostgreSQL | Filesystem access + logical-replication subscribers can read WAL | Real — treat as sensitive surface |
+| Managed PostgreSQL (RDS, Lakebase) with logical replication to a single trusted subscriber | WAL stays inside the managed environment | Bounded to the subscriber's security boundary |
+
+`FULL` should be applied **intentionally**, not by default, on tables that hold sensitive columns. DataJoint does not enable it automatically — the `set_replica_identity` helper is explicitly opt-in.
+
+## How DataJoint integrates this
+
+`REPLICA IDENTITY` is a **deployment-environment concern**: two installs of the same DataJoint pipeline can legitimately want different settings (one for CDC, one not). It is not part of the schema definition — declaring a table does not commit to a replica-identity mode.
+
+The integration is a single function in the `datajoint.deploy` module: `set_replica_identity(target, mode, dry_run)`. It applies the ALTER across a Schema or a single Table, supports a dry-run preview, and raises a clear error on non-PostgreSQL backends.
+
+A representative workflow for adding a Databricks Lakehouse Sync consumer to an existing pipeline:
+
+```python
+import datajoint as dj
+from datajoint import deploy
+
+schema = dj.Schema("acquisition")
+schema.activate()  # existing pipeline; tables already declared
+
+# Preview what would change.
+plan = deploy.set_replica_identity(schema, mode="full", dry_run=True)
+print(f"Would alter {plan['tables_analyzed']} tables:")
+for ddl in plan["ddl"]:
+    print(" ", ddl)
+
+# Apply.
+deploy.set_replica_identity(schema, mode="full", dry_run=False)
+```
+
+Because the operation is idempotent — re-applying the same mode is a no-op at the storage layer — a CI/CD pipeline can include it in the deploy hook for every release without accumulating side effects. New tables added in a future release pick up the setting on the next deploy.
+
+To revert (for example, to reduce WAL volume after a CDC consumer is decommissioned):
+
+```python
+deploy.set_replica_identity(schema, mode="default", dry_run=False)
+```
+
+`set_replica_identity` is PostgreSQL-only by design — there is no MySQL equivalent of `REPLICA IDENTITY`, since MySQL's binlog-based CDC follows different mechanics. Calling it against a MySQL connection raises `DataJointError` rather than warning quietly.
+
+## Related
+
+- Specification: [Deployment Operations](../reference/specs/deploy-operations.md) — normative API
+- PostgreSQL: [Logical Replication — Replica Identity](https://www.postgresql.org/docs/current/logical-replication-publication.html)
+- Databricks: [Lakehouse Sync](https://docs.databricks.com/aws/en/oltp/projects/lakehouse-sync)
diff --git a/src/reference/specs/deploy-operations.md b/src/reference/specs/deploy-operations.md
@@ -0,0 +1,107 @@
+# Deployment Operations Specification
+
+This document specifies the `datajoint.deploy` module — idempotent, re-runnable operations that configure an existing schema for its **deployment environment** (CDC tools, replication, role grants, performance tuning).
+
+For one-shot schema-evolution operations (column migrations, lineage repair, retroactive job-metadata columns), see `datajoint.migrate` (referenced in the [Data Manipulation Specification](data-manipulation.md)).
+
+!!! version-added "New in 2.3"
+    The `datajoint.deploy` module is introduced in DataJoint 2.3, beginning with `set_replica_identity` for PostgreSQL CDC integration.
+
+## Scope: migration vs. deployment
+
+DataJoint exposes two categories of operational helpers. The distinction is **load-bearing** — applying the wrong one at the wrong time produces inconsistent state.
+
+| | `datajoint.migrate` | `datajoint.deploy` |
+|---|---|---|
+| **Purpose** | Schema/state evolution, fixing legacy | Configure an environment for a consumer's requirements |
+| **Cadence** | One-shot transitions | Idempotent, re-runnable in deploy hooks |
+| **Trigger** | Schema definition changed, or repair needed | Environment changes (new CDC consumer, replication topology) |
+| **Examples** | `migrate_columns`, `add_job_metadata_columns`, `rebuild_lineage` | `set_replica_identity` |
+
+A deployment operation must be safe to call repeatedly without accumulating side effects: re-running it brings the environment to the same end state and is a no-op when already there.
+
+## `set_replica_identity`
+
+Apply `ALTER TABLE ... REPLICA IDENTITY DEFAULT|FULL` to every user table in a schema, or to a single table, on PostgreSQL.
+
+### Signature
+
+```python
+def set_replica_identity(
+    target: Schema | Table,
+    mode: Literal["default", "full"] = "full",
+    dry_run: bool = True,
+) -> dict
+```
+
+### Parameters
+
+| Name | Type | Default | Description |
+|---|---|---|---|
+| `target` | `Schema` or `Table` (class or instance) | — | Schema (all user tables) or a single table. |
+| `mode` | `str` | `"full"` | `"default"` (PK only) or `"full"` (entire old row). |
+| `dry_run` | `bool` | `True` | If `True`, collect DDL but do not execute. |
+
+### Return value
+
+A dict:
+
+| Key | Type | Description |
+|---|---|---|
+| `tables_analyzed` | `int` | Number of tables considered. |
+| `tables_modified` | `int` | Tables on which the ALTER ran. `0` when `dry_run=True`. |
+| `ddl` | `list[str]` | DDL statements that were (or would be) executed. |
+
+### Errors
+
+| Condition | Behavior |
+|---|---|
+| Connection's adapter is not PostgreSQL | `DataJointError`: `"set_replica_identity is PostgreSQL-only; …"` |
+| `mode` is not `"default"` or `"full"` | `DataJointError`: `"mode must be 'default' or 'full'; …"` |
+| `target` is not a `Schema` or `Table` | `DataJointError`: `"target must be a Schema or Table class/instance; …"` |
+
+### Behavior
+
+For each user table in the target (excluding `~`-prefixed hidden tables), the function builds `ALTER TABLE "{schema}"."{table}" REPLICA IDENTITY {MODE}` via the PostgreSQL adapter's `replica_identity_ddl()` and either records it (dry-run) or executes it on the connection.
+
+Both `default` and `full` produce explicit `ALTER` statements. `default` is **not** treated as a no-op — it actively resets the table to PostgreSQL's default, which is the right semantics when reverting from `FULL`.
+
+The underlying ALTER is metadata-only, instant, and idempotent at the PostgreSQL layer (re-applying the same mode is a no-op at the storage layer).
+
+## Design rationale
+
+Three structural decisions distinguish `dj.deploy` from alternatives that were considered and rejected. Each is informed by the failure modes the alternative would have produced.
+
+### 1. Migration-only, not auto-emit on `declare()`
+
+[Issue #1447](https://github.com/datajoint/datajoint-python/issues/1447) originally proposed two mechanisms — a `database.replica_identity` config flag applied automatically during `declare()`, plus a utility for existing tables. We collapsed to migration-only. Two mechanisms would produce **mixed state**: a deployment with the config set, applied mid-cycle, would have new tables at `FULL` and old tables at `DEFAULT` until someone remembered to run the migration. One mechanism is the only path that converges.
+
+### 2. Not in `dj.migrate`
+
+`dj.migrate` covers one-shot schema-evolution operations: fix lineage, add job-metadata columns, transform external store layouts. `set_replica_identity` is not a one-shot transition — a fresh declare in a staging environment may need it re-applied; deploy hooks may run it on every release. The cadence and trigger differ, and conflating them in one module obscures the difference.
+
+### 3. New module for an emerging category
+
+`set_replica_identity` is the first of a category. Plausible siblings, as needs arise:
+
+- Publication membership for PostgreSQL logical replication (`CREATE PUBLICATION … FOR TABLE …`).
+- Maintenance: `vacuum_analyze`, `reindex`, table-level autovacuum parameters.
+- Role/grant management for shared environments.
+
+Creating `dj.deploy` now — with one inhabitant — gives those future helpers a clear home and keeps `dj.migrate` focused. The cost is one file; the alternative is an indefinite period of "where do I put this?" for every operational helper.
+
+## Idempotency and re-running
+
+Every function in `datajoint.deploy` must be safe to re-run. `set_replica_identity` satisfies this because:
+
+1. The DDL is generated freshly each call.
+2. The PostgreSQL ALTER is metadata-only and applying the same mode again is a no-op at the storage layer.
+3. The dry-run path produces a complete preview without executing.
+
+Deploy hooks may call `set_replica_identity(schema, mode="full", dry_run=False)` on every release without accumulating side effects.
+
+## Related
+
+- Explanation: [PostgreSQL CDC and Replica Identity](../../explanation/postgresql-cdc-replication.md)
+- Data Manipulation Specification: [Data Manipulation](data-manipulation.md) (insert / update / delete; not deployment-time)
+- PostgreSQL: [Logical Replication — Replica Identity](https://www.postgresql.org/docs/current/logical-replication-publication.html)