Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
289 changes: 177 additions & 112 deletions src/explanation/relational-workflow-model.md
Original file line number Diff line number Diff line change
@@ -1,135 +1,187 @@
# The Relational Workflow Model

The relational data model has historically been interpreted through two
conceptual frameworks: Codd's mathematical foundation, which views tables as
logical predicates, and Chen's Entity-Relationship Model, which views tables
as entity types and relationships. The relational workflow model introduces a
third paradigm: **tables represent workflow steps, rows represent workflow
artifacts, and foreign key dependencies prescribe execution order.** This
adds an operational dimension absent from both predecessors—the schema
specifies not only what data exists but how it is derived.

The relational workflow model and its technical innovations are formally
defined in [Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585).
DataJoint's schema definition language and query algebra were first
formalized in [Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104).

## Three Paradigms Compared
The relational model has historically admitted two interpretations. Codd's
mathematical foundation (1970) views tables as logical predicates and rows
as true propositions — rigorous but abstract. Chen's Entity-Relationship
Model (1976) views tables as entity types or relationships — intuitive for
domain modeling, but silent on how entities come into being. The
**Relational Workflow Model** introduces a third interpretation: tables
represent workflow steps, rows represent workflow artifacts, and foreign
keys prescribe execution order. The schema specifies not only *what* data
exists but *how* it is derived — a single formal system in which data
structure, computational dependencies, and integrity constraints are all
queryable, enforceable, and machine-readable.

This unification is what makes DataJoint a *computational substrate* rather
than a database in the conventional sense. Each surrounding category of
tools is good at part of the problem and silent on the rest. File-based
workflow systems (CWL, Snakemake, Nextflow) offer flexibility but fragment
provenance across the filesystem and configuration. Task-centric
orchestrators (Airflow, Argo, Prefect) manage execution but remain agnostic
to data structure. Data catalogs (DataHub, Atlan, Marquez) describe data
after it lands. Lakehouses (Delta, Iceberg, Hudi) optimize analytical
queries but treat computation as external. The Relational Workflow Model
is the deliberate trade-off: framework commitment in exchange for one
formal system that addresses all four concerns at once.

## Three interpretations of the relational model

| Aspect | Mathematical (Codd) | Entity-Relationship (Chen) | **Relational Workflow (DataJoint)** |
|--------|---------------------|----------------------------|-------------------------------------|
| **Core question** | What functional dependencies exist? | What entity types exist? | **When/how are entities created?** |
| **Core question** | What functional dependencies exist? | What entity types exist? | **When and how are entities created?** |
| **Table semantics** | Logical predicate | Entity or relationship | **Workflow step** |
| **Row semantics** | True proposition | Entity instance | **Workflow artifact** |
| **Foreign keys** | Referential integrity | Relationship | **Execution order** |
| **Computation** | Not addressed | Not addressed | **Declared in schema** |
| **Provenance** | Not addressed | Not addressed | **Structural** |
| **Implementation gap** | High | High | **None** |

### Codd's Mathematical Foundation
## Four shifts from the classical relational model

Codd's mathematical foundation views tables as logical predicates and rows as
true propositions—rigorous but abstract.
- **Tables represent workflow steps**, not merely categories of records.
- **Rows represent workflow artifacts**, each with provenance to its inputs.
- **Foreign keys prescribe execution order**, not only referential integrity — the dependency graph *is* the pipeline DAG, enforced by the database.
- **Computed and Imported tables carry their own `make()` methods**, declaring derivation logic in the schema itself, not in an external workflow file.

### Chen's Entity-Relationship Model
The schema is therefore *active*, not passive. A row exists in a Computed
table if and only if its upstream key exists, its `make()` has run, and its
result satisfies the declared constraints. The schema is the executable
specification of the work.

Chen's Entity-Relationship Model shifted focus to domain modeling with
entities, attributes, and relationships—more intuitive, but lacking any
workflow or computational dimension.
## A worked example

## Core Concepts

### Workflow Steps and Artifacts
```mermaid
graph TD
Mouse["Mouse<br/><i>Manual</i>"]:::manual
Session["Session<br/><i>Manual</i>"]:::manual
Scan["Scan<br/><i>Manual</i>"]:::manual
SegParam["SegmentationParam<br/><i>Lookup</i>"]:::lookup
AvgFrame["AverageFrame<br/><i>Imported</i> &mdash; make()"]:::imported
Segmentation["Segmentation<br/><i>Computed</i> &mdash; make()"]:::computed
Fluorescence["Fluorescence<br/><i>Imported</i> &mdash; make()"]:::imported

Mouse --> Session --> Scan --> AvgFrame --> Segmentation --> Fluorescence
SegParam --> Segmentation

classDef manual fill:#c8e6c9,stroke:#2e7d32,color:#1b5e20;
classDef lookup fill:#e0e0e0,stroke:#616161,color:#212121;
classDef imported fill:#bbdefb,stroke:#1565c0,color:#0d47a1;
classDef computed fill:#ffcdd2,stroke:#c62828,color:#b71c1c;
```

Tables are classified into tiers by data entry mode:
`Mouse`, `Session`, and `Scan` are **Manual** tables entered by the
experimenter. `SegmentationParam` is a **Lookup** table holding reference
parameter sets. `AverageFrame` is **Imported** — its `make()` reads the
TIFF identified by `Scan` and stores the mean fluorescence frame.
`Segmentation` is **Computed** — its primary key fans in from both
`AverageFrame` and `SegmentationParam`, so every average frame is
segmented with every parameter set automatically. `Fluorescence` then
extracts per-ROI time-series traces from each segmentation. No external
scheduler is consulted: the foreign-key graph dictates what may run, what
must run first, and what already exists. The pipeline DAG and the database
schema are the same object.

## The deliberate trade-off

Decoupled architectures have legitimate advantages. File-based workflow
systems optimize for portability — any tool that reads files works.
Orchestrators evolve independently of the data model. Lakehouses give
analytics teams a layer that doesn't bind them to upstream pipeline
choices. These are the right trade-offs for many use cases.

DataJoint accepts tighter coupling deliberately. The cost is framework
commitment. The benefit is one system that knows the data structure, the
data, the computation that produced it, the dependencies between
computations, and the integrity constraints that govern all of it.
Everything an analyst, an engineer, or an AI agent might ask about the
work — *what is this, where did it come from, what depends on it, what
must hold for it to be valid, what would change if I touched the input* —
is answerable by query against a single formal model. For scientific
workflows where the data and the computation cannot be cleanly separated
without losing the science, this is the right trade-off.

## Substrate consequences

Because dependencies are declared before any computation runs, provenance
and lineage become **properties of the substrate**, not artifacts assembled
after the fact. Every row in `Segmentation` is reachable by foreign key
from the exact `AverageFrame` and `SegmentationParam` that produced it;
cascade deletes remove dependent results when their inputs become invalid.
Reproducibility is structural rather than retrofitted by audit: a computed
result cannot exist without its upstream entities, and the declared types
and constraints must hold. The model enforces what other systems merely
log. The lineage graph is already in the schema; mapping it to external
standards such as W3C PROV or OpenLineage is a translation, not a
reconstruction.

The same property makes the schema a shared contract between humans and
the machines that increasingly collaborate with them. The schema is
**self-describing**: an agent can introspect table structure, dependencies,
and state programmatically. Operations are **safe by default**: invalid
joins, type mismatches, and referential violations fail cleanly rather
than corrupting data silently. The dependency graph is **explicit**:
agents reason about execution order without implicit knowledge. Core
operations are **idempotent**: retries on failure are without side effects.
And all state — job status, computation progress, errors — is
**queryable**, so the work is observable as it happens. These are the
properties that let agents participate in scientific workflows with the
same transactional guarantees that protect human-initiated work.

## Beneath the model

The remaining sections detail the structural elements that make the model
work in practice.

### Workflow steps and table tiers

Tables are classified into tiers by data-entry mode:

| Tier | Role | `make()` |
|------|------|----------|
| **Manual** | Receive direct user entry | No |
| **Lookup** | Hold reference data | No |
| **Imported** | Reach out to data sources outside the DataJoint system (instruments, electronic lab notebooks, external databases) | Yes |
| **Imported** | Reach out to data sources outside DataJoint (instruments, ELNs, external databases) | Yes |
| **Computed** | Derive their contents entirely from upstream DataJoint tables | Yes |

Imported and Computed tables define computations via `make()` methods. The
`make()` method specifies how each entity is derived—this computation logic is
declared within the table definition, making it part of the schema itself
rather than an external workflow specification.

### Dependencies as Foreign Keys

Foreign keys define computational dependencies, not only referential integrity.
The dependency graph is explicit, queryable, and enforced by the database.
`make()` method specifies how each entity is derived — declared within the
table definition, not in an external workflow file.

```mermaid
graph LR
A[Session] --> B[Scan]
B --> C[Segmentation]
C --> D[Analysis]
```

### Master-Part Relationships
### Master-part relationships

Master-part relationships declare transactional grouping directly in the
schema: the master table represents the workflow step, while part tables hold
the individual items. Insertions and deletions cascade as a unit, enforcing
transactional semantics without application code.

### Directed Acyclic Graph

Dependencies between tables form a directed acyclic graph (DAG); aggregated
dependencies between schemas likewise form a DAG. Unlike task DAGs in
workflow managers, these are *relational schema* DAGs—they define data
structure and relationships, not just execution steps.

## Active Schemas

The key distinction from classical models: traditional schemas are
*passive*—containers for data produced by external processes. In the
relational workflow model, the schema is *active*—Computed tables declare how
their contents are derived, making the schema itself the workflow
specification. Schemas are defined as Python classes, and entire pipelines are
organized as self-contained code repositories—version-controlled, testable,
and deployable using standard software engineering practices.

A useful analogy: electronic spreadsheets unified data and computation—cells
with values alongside cells with formulas. Yet this integration never
penetrated relational databases in their 50+ years of history. The relational
workflow model brings to databases what spreadsheets brought to tabular
calculation: the recognition that data and the computations that produce it
belong together. The analogy has limits: spreadsheets' coupling is also the
source of their well-known fragility. DataJoint addresses this through formal
schema constraints and explicit dependency declaration rather than ad-hoc cell
references.

## Workflow Normalization

> **"Every table represents an entity type created at a specific workflow
> step, and all attributes describe that entity as it exists at that step."**

Database normalization decomposes data into tables to eliminate redundancy.
Classical normalization theory achieves this through normal forms based on
functional dependencies. Entity normalization asks whether each attribute
describes the entity identified by the primary key. Workflow normalization
extends these principles with a temporal dimension.

A Session table contains attributes known when the session is entered (date,
experimenter, subject). Analysis parameters determined later belong in
Computed tables that depend on Session. This discipline prevents tables that
accumulate attributes from different workflow stages, obscuring provenance and
schema. The master table represents the workflow step; part tables hold
the items produced together. Insertions and deletions cascade as a unit,
enforcing transactional semantics without application code.

### Workflow normalization

> "Every table represents an entity type created at a specific workflow
> step, and all attributes describe that entity as it exists at that
> step."

Classical normalization theory decomposes tables to eliminate redundancy
through normal forms based on functional dependencies. Entity normalization
asks whether each attribute describes the entity identified by the primary
key. **Workflow normalization** extends these principles with a temporal
dimension: each table's attributes must describe its entity *as it exists
at the workflow step the table represents*. A `Session` table holds
attributes known when the session is entered (date, experimenter,
subject); analysis parameters determined later belong in Computed tables
that depend on `Session`. The discipline prevents tables that accumulate
attributes from different workflow stages, obscuring provenance and
complicating updates.

## Entity Integrity
### Entity integrity

All data is represented as well-formed entity sets with primary keys
identifying each entity uniquely. This eliminates redundancy and ensures
consistent updates.
identifying each entity uniquely. When upstream data is deleted, dependent
results cascade-delete automatically — including associated objects in
external storage. To correct errors, you delete, reinsert, and recompute,
ensuring every result represents a consistent computation from valid
inputs.

When upstream data is deleted, dependent results cascade-delete
automatically—including associated objects in external storage. To correct
errors, you delete, reinsert, and recompute, ensuring every result represents
a consistent computation from valid inputs.

## Query Algebra
### Query algebra and algebraic closure

DataJoint provides a five-operator algebra:

Expand All @@ -143,14 +195,14 @@ DataJoint provides a five-operator algebra:

The algebra achieves *algebraic closure*: every operator produces a valid
entity set with a well-defined primary key, enabling unlimited composition.
This preservation of entity integrityevery query result is itself a proper
entity set with clear identitydistinguishes DataJoint's algebra from SQL,
where query results lack both a well-defined primary key and a clear entity
type.
This preservation of entity integrityevery query result is itself a
proper entity set with clear identitydistinguishes DataJoint's algebra
from SQL, where query results lack both a well-defined primary key and a
clear entity type.

## From Transactions to Transformations
## From transactions to transformations

| Traditional View | Workflow View |
| Traditional view | Workflow view |
|------------------|---------------|
| Tables store data | Tables represent workflow steps |
| Rows are records | Rows are workflow artifacts |
Expand All @@ -159,10 +211,23 @@ type.
| Schemas organize storage | Schemas specify pipelines |
| Queries retrieve data | Queries trace provenance |

## Summary

The relational workflow model offers a new way to understand relational
databases—not merely as storage systems but as computational substrates. By
interpreting tables as workflow steps and foreign keys as execution
dependencies, the schema becomes a complete specification of how data is
derived, not just what data exists.
## Further reading

The Relational Workflow Model and its technical innovations are formally
defined in [Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585),
which also introduces the further substrate elements that build on it:
object-augmented schemas, semantic matching by attribute lineage, an
extensible type system, and distributed job coordination. DataJoint's
schema definition language and query algebra were first formalized in
[Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104).

### See also

- [Data Pipelines](data-pipelines.md) — table tiers, schema organization, and the DAG in practice
- [Computation Model](computation-model.md) — the `make()` contract, `populate()`, and the key source
- [Entity Integrity](entity-integrity.md) — primary keys and the three questions every table answers
- [Normalization](normalization.md) — entity normalization extended with a temporal dimension
- [Query Algebra](query-algebra.md) — the five-operator algebra with algebraic closure
- [Semantic Matching](semantic-matching.md) — lineage-based join resolution
- [Type System](type-system.md) — extensible types with pluggable codecs
- [Define Tables](../how-to/define-tables.md) and [Run Computations](../how-to/run-computations.md) — declaring steps and executing them
Loading