From 5e09fd7ac597bc858b0fd574a6736da3e03a7c29 Mon Sep 17 00:00:00 2001
From: Dimitri Yatsenko <dimitri.yatsenko@gmail.com>
Date: Wed, 10 Jun 2026 18:07:41 -0500
Subject: [PATCH] docs(provenance): spec for Diagram.trace + self.upstream +
 strict_provenance
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Detailed normative specification for the canonical provenance trinity
landing in DataJoint 2.3:

- Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(),
  walks the FK graph from a restricted seed to every ancestor with OR
  convergence. Reuses the upward propagation rules (U1/U2/U3) defined
  in the Cascade Specification. Supports trace[TableClass] returning a
  pre-restricted QueryExpression, and trace[str] returning a FreeTable.
- self.upstream — property on AutoPopulate that the framework sets to
  Diagram.trace(self & key) before each make() invocation. Pre-restricted
  ancestor access for ergonomic, provenance-safe reads.
- dj.config["strict_provenance"] — runtime flag (default False). When
  True, gates make() so reads must go through self.upstream and writes
  must target self or self's Parts with key-consistent primary keys.

Spec structure:
- Why this exists — the convention/enforcement gap, why three pieces.
- Concepts — trace as cascade's mirror, OR convergence, the make()
  provenance boundary, why upward rules are shared with cascade.
- §1 Diagram.trace — signature, behavior, indexing (class + string),
  counts() / heading() / iter, worked examples.
- §2 self.upstream — construction lifecycle, allowed table set,
  per-key behavior, examples.
- §3 strict_provenance — config key, read enforcement table, write
  enforcement table, key-consistency rule, opt-in rationale, examples
  of compliant + violating make() bodies.
- Integration — end-to-end strict-mode example, properties guaranteed.
- Migration path — staged adoption from on-default-off to enforcement.
- Out of scope — static analysis, default flip, downstream provenance
  metadata persistence.

Examples use core DataJoint types (int32, float64, longblob) per
project convention. Cross-links to cascade.md (shared upward rules),
autopopulate.md, diagram.md, entity-integrity.md.

Nav entry under Reference > Specifications > Data Operations between
AutoPopulate and Job Metadata.

Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c
(#1425) implementations land against this spec.
---
 mkdocs.yaml                       |   1 +
 src/reference/specs/provenance.md | 407 ++++++++++++++++++++++++++++++
 2 files changed, 408 insertions(+)
 create mode 100644 src/reference/specs/provenance.md

diff --git a/mkdocs.yaml b/mkdocs.yaml
index 7545366b..4623dfa4 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -115,6 +115,7 @@ nav:
           - Data Operations:
               - Data Manipulation: reference/specs/data-manipulation.md
               - AutoPopulate: reference/specs/autopopulate.md
+              - Provenance Trace & Strict Provenance: reference/specs/provenance.md
               - Job Metadata: reference/specs/job-metadata.md
               - Object Store Configuration: reference/specs/object-store-configuration.md
       - Instance & Thread Safety:
diff --git a/src/reference/specs/provenance.md b/src/reference/specs/provenance.md
new file mode 100644
index 00000000..afe3a4c0
--- /dev/null
+++ b/src/reference/specs/provenance.md
@@ -0,0 +1,407 @@
+# Provenance: Trace, `self.upstream`, and Strict-Provenance `make()` — Specification
+
+This document specifies three interlocking features that together close DataJoint's provenance loop:
+
+1. **`Diagram.trace(table_expr)`** — the upstream mirror of `Diagram.cascade()`; walks the FK graph from a restricted table expression to every ancestor.
+2. **`self.upstream`** — a property on `AutoPopulate` tables that exposes the trace for the current `key` during `make()` execution.
+3. **`dj.config["strict_provenance"]`** — a runtime flag that enforces the upstream-only convention inside `make()` (read only from `self.upstream`, write only to `self` and its Parts).
+
+The three are designed as a unit. `trace` is the underlying graph operation; `self.upstream` is the ergonomic surface for `make()`; `strict_provenance` is the enforcement layer that turns the convention into a runtime guarantee.
+
+!!! version-added "New in DataJoint 2.3"
+    Introduced together. Existing code that does not use these features is unaffected; `strict_provenance` defaults to `False`.
+
+For the related downstream operation, see [Cascade Specification](cascade.md). For `make()` itself, see [AutoPopulate Specification](autopopulate.md).
+
+## Why this exists
+
+DataJoint's provenance guarantee — that every computed row can be traced back to the exact upstream rows it was derived from — rests on the convention that `make(self, key)` only reads from declared upstream dependencies. Today the framework **defines** this convention but does not **enforce** it: a `make()` can `fetch()` from any table, making the undeclared dependency invisible to the FK graph and breaking the provenance claim downstream.
+
+Closing the loop requires three pieces working together:
+
+1. A way to **construct** the upstream view as a query object: `Diagram.trace()`.
+2. A way to **expose** that view ergonomically inside `make()` so users naturally do the right thing: `self.upstream`.
+3. A way to **enforce** that nothing else is read or written: `strict_provenance`.
+
+Without (1), downstream tools that need row-level provenance (data-lineage viewers, CDC publishers, audit logs) have to reimplement the FK walk themselves — repeatedly, with subtly different bugs. Without (2), even users who follow the convention write awkward boilerplate (`(Upstream & key).fetch1(...)` everywhere). Without (3), the convention can be silently violated and downstream provenance claims become best-effort rather than guarantees.
+
+## Concepts
+
+### Trace as the upstream mirror of cascade
+
+`Diagram.cascade()` walks **downstream** from a restricted seed and answers *"what is affected if these rows are deleted?"* `Diagram.trace()` walks **upstream** and answers *"what contributed to these rows?"*. The two share the same dependency graph, the same alias-node mechanism, and most of the same propagation machinery — only the direction differs.
+
+| Method | Direction | Convergence | Question answered |
+|---|---|---|---|
+| `Diagram.cascade(expr)` | downstream | OR — any FK path taints | What's affected if these rows are deleted? |
+| `Diagram.restrict(expr)` | downstream | AND — must satisfy all FK paths | What satisfies all of these conditions? |
+| `Diagram.trace(expr)` | **upstream** | **OR** — any FK path contributes | What contributed to these rows? |
+
+`trace` uses OR convergence because an ancestor entity contributes to a child row if it appears via *any* FK path. (The AND-flavored upstream analog — "ancestors that contributed via *every* path" — is not a useful query and is not provided.)
+
+### Reusing the propagation primitives
+
+`trace` applies the **upward propagation rules** (`U1`, `U2`, `U3`) defined in the [Cascade Specification](cascade.md#upward-propagation-child-parent), which are the symmetric inverses of `cascade`'s forward rules. Renamed FKs (`.proj()`) are reversed via U2; Part-of-Part chains are walked through naturally; alias nodes in the graph are transparent.
+
+This is why `trace` cannot ship before the upward primitives exist in the codebase. As of DataJoint 2.3, the primitives are in place (added with [#1429's cascade fix](https://github.com/datajoint/datajoint-python/pull/1468)), and `trace` is a direct consumer.
+
+### The `make()` provenance boundary
+
+DataJoint's `make(self, key)` convention has always been:
+
+```
+Read from self's declared upstream dependencies; write to self (and its Parts).
+```
+
+The convention defines the **provenance boundary**: data flows in from declared ancestors and out to the target table and its Parts. Anything that crosses this boundary in another direction breaks the provenance claim — a `fetch` from an undeclared table makes that dependency invisible; an `insert` into another table sneaks rows past the FK graph.
+
+The trinity makes this boundary explicit and (optionally) enforced:
+
+- **`self.upstream`** is the only blessed read channel. Reading from `self.upstream[Ancestor]` is provenance-safe by construction.
+- **`self`** (and its Parts) is the only blessed write target.
+- **`strict_provenance=True`** enforces both.
+
+## 1. `Diagram.trace(table_expr)`
+
+### Signature
+
+```python
+@classmethod
+def trace(cls, table_expr: QueryExpression) -> Diagram:
+    ...
+```
+
+| Parameter | Type | Description |
+|---|---|---|
+| `table_expr` | `QueryExpression` | A (possibly restricted) table expression. The trace propagates this restriction upstream through the FK graph. |
+
+Returns a `Diagram` instance whose nodes are the seed and all of its ancestors (across all loaded schemas), each carrying a restriction derived from the seed.
+
+### Behavior
+
+`trace` mirrors `cascade`:
+
+1. Load the dependency graph via `connection.dependencies.load_all_upstream()` (the upstream analog of `load_all_downstream` — discovers all schemas reachable via reverse FK edges from the seed's schema).
+2. Take the seed's restriction and propagate it **upstream** along the FK graph. For each edge `parent → child`, when the child has a restriction, apply the upward rule (`U1`, `U2`, or `U3` per the cascade spec) to derive the parent's restriction.
+3. Trim the resulting graph to **seed + ancestors only**. Descendants of the seed and unrelated ancestors are not included.
+4. Convergence is **OR**: an ancestor entity is included if reachable through *any* FK path from the seed (consistent with how a child row "comes from" any of its FK parents).
+
+The graph traversal is dual to `cascade`'s. `cascade` walks `seed + descendants(seed)`; `trace` walks `seed + ancestors(seed)`. Both apply rules edge-by-edge; both terminate because the dependency graph is a DAG.
+
+### Indexing: `trace[T]`
+
+The primary user-facing operation. Returns the ancestor's table pre-restricted through the FK join path.
+
+```python
+trace = dj.Diagram.trace(MyChild & key)
+session = trace[Session]              # QueryExpression: Session restricted to ancestors of MyChild & key
+session.fetch1("session_date")
+trace[ExtractTraces].to_arrays("trace")
+```
+
+The argument may be:
+
+- **A Table subclass** (e.g. `Session`) — returns the pre-restricted ancestor table as a `QueryExpression`.
+- **A string** giving a table's fully-qualified or class name (e.g. `"experiment.Session"` or `"Session"`) — returns the pre-restricted ancestor as a `FreeTable`.
+
+Requesting a table that is not an ancestor (or the seed itself) of `table_expr` raises `DataJointError`. This is the same guarantee that `cascade` provides for descendants.
+
+### Other methods
+
+| Method | Behavior |
+|---|---|
+| `trace.counts()` | Dict mapping each ancestor's full table name to the count of contributing rows under the seed's restriction. Mirror of `cascade.counts()`. |
+| `trace.heading()` | Aggregated heading across all ancestor tables, grouped by table. Useful for displaying the full provenance schema for a row. |
+| `iter(trace)` | Iterate over pre-restricted ancestor `FreeTable`s in topological order (deepest ancestors first). |
+
+All `QueryExpression` operators (`join`, `aggr`, `proj`, `restrict`, `&`, `*`) work on the values returned by `trace[T]`. The trace is a *view* into the graph; the returned `QueryExpression`s are first-class.
+
+### Examples
+
+```python
+import datajoint as dj
+
+schema = dj.Schema("imaging")
+
+@schema
+class Subject(dj.Manual):
+    definition = """
+    subject_id : int32
+    """
+
+@schema
+class Session(dj.Manual):
+    definition = """
+    -> Subject
+    session_id : int32
+    ---
+    session_date : date
+    """
+
+@schema
+class Scan(dj.Imported):
+    definition = """
+    -> Session
+    scan_id : int32
+    """
+
+@schema
+class ExtractTraces(dj.Computed):
+    definition = """
+    -> Scan
+    ---
+    trace : float64
+    """
+
+@schema
+class Summary(dj.Computed):
+    definition = """
+    -> ExtractTraces
+    ---
+    summary_stat : float64
+    """
+```
+
+Given a populated `Summary` row, the trace reaches every ancestor:
+
+```python
+key = {"subject_id": 1, "session_id": 5, "scan_id": 2}
+trace = dj.Diagram.trace(Summary & key)
+
+trace[Subject].fetch1("subject_id")           # 1
+trace[Session].fetch1("session_date")         # date of session 5
+trace[Scan].fetch1("scan_id")                 # 2
+trace[ExtractTraces].fetch1("trace")          # the trace that produced this Summary
+
+trace.counts()                                # {Subject: 1, Session: 1, Scan: 1, ExtractTraces: 1, Summary: 1}
+```
+
+For a renamed-FK case (paralleling [Cascade Spec §Worked Example 1](cascade.md#example-1-part-of-part-with-renamed-fk)), the upward rules reverse the rename so `trace[Ancestor]` returns the ancestor with its native column names regardless of how the seed's columns are named.
+
+## 2. `self.upstream` inside `make()`
+
+### Where it's set
+
+The framework constructs `self.upstream = Diagram.trace(self & key)` immediately before invoking the user-defined `make(self, key)`. The construction happens once per `make()` call; `self.upstream` is a regular attribute on the bound `self` instance for the duration of the call.
+
+The construction is **lazy** in the sense that the underlying SQL is not issued until the user accesses an ancestor: `self.upstream[Session]` builds a `QueryExpression`, and `.fetch1()` on that expression triggers a SELECT.
+
+### What it returns
+
+`self.upstream[T]` returns the same value `dj.Diagram.trace(self & key)[T]` would return — the ancestor table pre-restricted to the rows that contributed to the current `key`. All `QueryExpression` operators work; the standard fetch API (`fetch`, `fetch1`, `to_arrays`, `to_dicts`, `to_pandas`) is fully supported.
+
+### Allowed table set
+
+`self.upstream` exposes:
+
+- All declared ancestors of `self` (transitively, including renamed-FK chains).
+- The Parts of any ancestor included by the FK graph.
+
+Requesting any table outside this set raises `DataJointError` — including tables that exist in the schema but are not ancestors of `self`. This is the same guarantee `Diagram.trace(...)` provides; `self.upstream` is just the per-`make()` instance of it.
+
+### Examples
+
+Without `strict_provenance` (the default), `self.upstream` is a convenience — the same data is reachable via `(Upstream & key).fetch1(...)`. The win is ergonomic:
+
+```python
+def make(self, key):
+    # All upstream reads pre-restricted to the current key
+    session_date = self.upstream[Session].fetch1("session_date")
+    traces = self.upstream[ExtractTraces].to_arrays("trace")
+    summary = compute(traces, session_date)
+    self.insert1({**key, "summary_stat": summary})
+```
+
+Multi-ancestor reads remain natural:
+
+```python
+def make(self, key):
+    # Aggregate one ancestor restricted by another, all still pre-keyed
+    counts = self.upstream[Session].aggr(
+        self.upstream[Scan], n="count(scan_id)"
+    ).fetch("n", as_dict=True)
+    ...
+```
+
+Part tables of ancestors are accessible:
+
+```python
+def make(self, key):
+    # If an ancestor has Parts, they appear in the trace too
+    annotations = self.upstream[Session.Annotation].fetch(as_dict=True)
+    ...
+```
+
+`make_kwargs` and `key` are unaffected. Existing `make()` implementations that never reference `self.upstream` behave identically to today.
+
+## 3. `dj.config["strict_provenance"]`
+
+### Config key
+
+| Key | Type | Default | Description |
+|---|---|---|---|
+| `strict_provenance` | `bool` | `False` | When `True`, enforces upstream-only reads and target-only writes inside `make()`. |
+
+The flag is read at the start of each `make()` invocation. Changing it between invocations is supported; concurrent populates with different flag values are not (the value seen depends on thread-local context — see Implementation notes).
+
+### Read enforcement
+
+When `strict_provenance=True` and a `make()` is executing:
+
+| Read attempt | Outcome |
+|---|---|
+| `self.upstream[Ancestor].fetch(...)` (or any other fetch method) | **Allowed** |
+| `self.upstream[Ancestor] & condition` then fetch | **Allowed** |
+| `(SomeTable & key).fetch(...)` where `SomeTable` is an ancestor | **Blocked** — raises `DataJointError`. The user must go through `self.upstream`. |
+| `(SomeTable & ...).fetch(...)` where `SomeTable` is **not** an ancestor of `self` | **Blocked** |
+| Reading `self` or `self.PartName` | **Allowed** — the target's own state may be inspected mid-`make()`. |
+
+The check is at fetch time, not at table-expression construction time. Building a `QueryExpression` is free; executing it (calling `fetch`, `fetch1`, `to_arrays`, etc.) is gated.
+
+### Write enforcement
+
+When `strict_provenance=True` and a `make()` is executing:
+
+| Write attempt | Outcome |
+|---|---|
+| `self.insert1(row)` | **Allowed** (with key-consistency check, see below) |
+| `self.PartName.insert(rows)` | **Allowed** (with key-consistency check) |
+| `SomeOtherTable.insert(...)` | **Blocked** — raises `DataJointError`. |
+| Reading from `self.upstream[Ancestor]` and writing to `Ancestor` | **Blocked** — `Ancestor` is not `self`. |
+
+### Key consistency
+
+Every row inserted into `self` or `self`'s Parts must have a primary key consistent with the current `key`:
+
+- For attributes that overlap with `key` (i.e. `self`'s primary key attributes that the AutoPopulate framework provided), the row's values must equal `key`'s values.
+- For Part-specific PK attributes (additional columns beyond `key`), any value is allowed.
+
+This prevents a `make()` from inserting rows for entities it wasn't called with — closing the last loophole where one entity's `make()` could silently produce data attributed to another.
+
+### Default behavior
+
+`strict_provenance=False` (the default) leaves all `make()` behavior unchanged. No deprecation warning, no soft enforcement, no compatibility shim. Existing pipelines run identically.
+
+### Why opt-in
+
+`strict_provenance` is an **operational** flag — a property of how a deployment runs, not of the schema definition. Teams may want it on in production (where provenance guarantees back downstream claims) and off during development (where ad-hoc `fetch`/`insert` from `make()` is a useful debugging affordance). Opting in is a deliberate deployment choice. Forcing it on would break existing pipelines without giving teams the tools to migrate incrementally.
+
+A future major release may flip the default; that decision is out of scope for 2.3.
+
+### Examples
+
+A `make()` that complies with strict provenance:
+
+```python
+def make(self, key):
+    # All reads through self.upstream
+    rate = self.upstream[Recording].fetch1("sampling_rate")
+    samples = self.upstream[Recording].to_arrays("signal")
+
+    # Compute
+    spectrum = compute_spectrum(samples, rate)
+
+    # Write to self only, with key
+    self.insert1({**key, "spectrum": spectrum})
+
+    # Part write — also scoped to current key
+    for bin_id, energy in enumerate(spectrum):
+        self.Bin.insert1({**key, "bin_id": bin_id, "energy": float(energy)})
+```
+
+A `make()` that **violates** strict provenance:
+
+```python
+def make(self, key):
+    # ERROR: reading a table not in self.upstream
+    rate = (Recording & key).fetch1("sampling_rate")
+    # → DataJointError: Recording is reachable but read outside self.upstream
+    #   (strict_provenance=True disallows direct reads from ancestor tables).
+
+    # ERROR: writing to a table that isn't self or self's Parts
+    AuditLog.insert1({"event": "populated_spectrum"})
+    # → DataJointError: AuditLog is not self or one of self's Parts.
+
+    # ERROR: writing a row whose key doesn't match the current key
+    self.insert1({"subject_id": 99, "spectrum": ...})
+    # → DataJointError: row primary key {'subject_id': 99} does not match
+    #   the current key {'subject_id': key['subject_id'], ...}.
+```
+
+## Integration: the full provenance loop
+
+The trinity composes into a single guarantee:
+
+1. `Diagram.trace()` defines the upstream view as a first-class query object.
+2. `self.upstream` exposes the per-`key` slice of that view inside `make()`.
+3. `strict_provenance` ensures `make()` reads only from that slice and writes only to `self`.
+
+End-to-end, with all three engaged:
+
+```python
+import datajoint as dj
+dj.config["strict_provenance"] = True
+
+@schema
+class Spectrum(dj.Computed):
+    definition = """
+    -> Recording
+    ---
+    spectrum : longblob
+    """
+
+    class Bin(dj.Part):
+        definition = """
+        -> master
+        bin_id : int32
+        ---
+        energy : float64
+        """
+
+    def make(self, key):
+        # Pure provenance-safe reads
+        rate = self.upstream[Recording].fetch1("sampling_rate")
+        samples = self.upstream[Recording].to_arrays("signal")
+
+        # Compute
+        spectrum = compute_spectrum(samples, rate)
+
+        # Scoped writes
+        self.insert1({**key, "spectrum": spectrum})
+        self.Bin.insert([
+            {**key, "bin_id": i, "energy": float(e)} for i, e in enumerate(spectrum)
+        ])
+```
+
+Properties guaranteed in strict mode:
+
+- Every column in `Spectrum.fetch1(key)` is derived from data accessible via `Diagram.trace(Spectrum & key)`. No undeclared dependencies.
+- Every row in `Spectrum` and `Spectrum.Bin` was inserted by a `make()` whose `key` matched. No misattributed rows.
+- Downstream provenance tooling — row-level lineage views, CDC publishers, audit logs — can rely on these properties statically rather than as best-effort.
+
+## Migration path
+
+Teams adopt the trinity incrementally:
+
+1. **Upgrade to 2.3.** The new APIs are available; `strict_provenance` defaults to off. Existing code is unaffected.
+2. **Start using `self.upstream` in new `make()` implementations.** Reads become ergonomic; nothing else changes.
+3. **Migrate existing `make()` implementations** as opportunity allows. Replace `(Upstream & key).fetch(...)` with `self.upstream[Upstream].fetch(...)`. No semantic difference at this stage.
+4. **Enable `strict_provenance=True` in staging.** Identify and fix any undeclared dependencies the runtime check surfaces.
+5. **Enable in production.** Provenance guarantees now hold by construction.
+
+Steps (3) and (4) are where most of the work lies — the runtime check is the mechanism that surfaces the actual undeclared dependencies in a pipeline. Teams with clean conventions will find few violations; teams with debugging fetches scattered through `make()` will find more. Either way, the cost is bounded by the size of the existing `make()` surface.
+
+## What is not in this specification
+
+- **Static analysis of `make()`**. The runtime check is the only enforcement mechanism in 2.3; analyzing `make()` bodies offline to flag potential violations is future work.
+- **Flipping the default to `True`**. Whether (and when) `strict_provenance` becomes the default is a separate decision for a future major release.
+- **Row-level provenance metadata in storage**. The trinity provides the *graph operation* and the *enforcement*; persisting per-row provenance projections (the dj-delta-style silver-layer features) is a downstream consumer concern, tracked separately.
+
+## References
+
+- Source (to land in 2.3, against this spec): `src/datajoint/diagram.py` (`Diagram.trace`), `src/datajoint/autopopulate.py` (`AutoPopulate.upstream`), `src/datajoint/settings.py` (config), `src/datajoint/expression.py` and `src/datajoint/table.py` (runtime gates).
+- Issues: [#1423](https://github.com/datajoint/datajoint-python/issues/1423) (Diagram.trace), [#1424](https://github.com/datajoint/datajoint-python/issues/1424) (self.upstream), [#1425](https://github.com/datajoint/datajoint-python/issues/1425) (strict_provenance).
+- [Cascade Specification](cascade.md) — propagation rules (F1/F2/F3 forward, U1/U2/U3 upward) shared with `trace`.
+- [AutoPopulate Specification](autopopulate.md) — `make()` execution model.
+- [Diagram Specification](diagram.md) — graph operations on the dependency graph.
+- [Entity Integrity](../../explanation/entity-integrity.md) — schema dimensions and FK semantics.