diff --git a/src/explanation/relational-workflow-model.md b/src/explanation/relational-workflow-model.md index e3315367..a7b94da7 100644 --- a/src/explanation/relational-workflow-model.md +++ b/src/explanation/relational-workflow-model.md @@ -1,24 +1,34 @@ # The Relational Workflow Model -The relational data model has historically been interpreted through two -conceptual frameworks: Codd's mathematical foundation, which views tables as -logical predicates, and Chen's Entity-Relationship Model, which views tables -as entity types and relationships. The relational workflow model introduces a -third paradigm: **tables represent workflow steps, rows represent workflow -artifacts, and foreign key dependencies prescribe execution order.** This -adds an operational dimension absent from both predecessors—the schema -specifies not only what data exists but how it is derived. - -The relational workflow model and its technical innovations are formally -defined in [Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585). -DataJoint's schema definition language and query algebra were first -formalized in [Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104). - -## Three Paradigms Compared +The relational model has historically admitted two interpretations. Codd's +mathematical foundation (1970) views tables as logical predicates and rows +as true propositions — rigorous but abstract. Chen's Entity-Relationship +Model (1976) views tables as entity types or relationships — intuitive for +domain modeling, but silent on how entities come into being. The +**Relational Workflow Model** introduces a third interpretation: tables +represent workflow steps, rows represent workflow artifacts, and foreign +keys prescribe execution order. The schema specifies not only *what* data +exists but *how* it is derived — a single formal system in which data +structure, computational dependencies, and integrity constraints are all +queryable, enforceable, and machine-readable. + +This unification is what makes DataJoint a *computational substrate* rather +than a database in the conventional sense. Each surrounding category of +tools is good at part of the problem and silent on the rest. File-based +workflow systems (CWL, Snakemake, Nextflow) offer flexibility but fragment +provenance across the filesystem and configuration. Task-centric +orchestrators (Airflow, Argo, Prefect) manage execution but remain agnostic +to data structure. Data catalogs (DataHub, Atlan, Marquez) describe data +after it lands. Lakehouses (Delta, Iceberg, Hudi) optimize analytical +queries but treat computation as external. The Relational Workflow Model +is the deliberate trade-off: framework commitment in exchange for one +formal system that addresses all four concerns at once. + +## Three interpretations of the relational model | Aspect | Mathematical (Codd) | Entity-Relationship (Chen) | **Relational Workflow (DataJoint)** | |--------|---------------------|----------------------------|-------------------------------------| -| **Core question** | What functional dependencies exist? | What entity types exist? | **When/how are entities created?** | +| **Core question** | What functional dependencies exist? | What entity types exist? | **When and how are entities created?** | | **Table semantics** | Logical predicate | Entity or relationship | **Workflow step** | | **Row semantics** | True proposition | Entity instance | **Workflow artifact** | | **Foreign keys** | Referential integrity | Relationship | **Execution order** | @@ -26,110 +36,152 @@ formalized in [Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104) | **Provenance** | Not addressed | Not addressed | **Structural** | | **Implementation gap** | High | High | **None** | -### Codd's Mathematical Foundation +## Four shifts from the classical relational model -Codd's mathematical foundation views tables as logical predicates and rows as -true propositions—rigorous but abstract. +- **Tables represent workflow steps**, not merely categories of records. +- **Rows represent workflow artifacts**, each with provenance to its inputs. +- **Foreign keys prescribe execution order**, not only referential integrity — the dependency graph *is* the pipeline DAG, enforced by the database. +- **Computed and Imported tables carry their own `make()` methods**, declaring derivation logic in the schema itself, not in an external workflow file. -### Chen's Entity-Relationship Model +The schema is therefore *active*, not passive. A row exists in a Computed +table if and only if its upstream key exists, its `make()` has run, and its +result satisfies the declared constraints. The schema is the executable +specification of the work. -Chen's Entity-Relationship Model shifted focus to domain modeling with -entities, attributes, and relationships—more intuitive, but lacking any -workflow or computational dimension. +## A worked example -## Core Concepts - -### Workflow Steps and Artifacts +```mermaid +graph TD + Mouse["Mouse
Manual"]:::manual + Session["Session
Manual"]:::manual + Scan["Scan
Manual"]:::manual + SegParam["SegmentationParam
Lookup"]:::lookup + AvgFrame["AverageFrame
Imported — make()"]:::imported + Segmentation["Segmentation
Computed — make()"]:::computed + Fluorescence["Fluorescence
Imported — make()"]:::imported + + Mouse --> Session --> Scan --> AvgFrame --> Segmentation --> Fluorescence + SegParam --> Segmentation + + classDef manual fill:#c8e6c9,stroke:#2e7d32,color:#1b5e20; + classDef lookup fill:#e0e0e0,stroke:#616161,color:#212121; + classDef imported fill:#bbdefb,stroke:#1565c0,color:#0d47a1; + classDef computed fill:#ffcdd2,stroke:#c62828,color:#b71c1c; +``` -Tables are classified into tiers by data entry mode: +`Mouse`, `Session`, and `Scan` are **Manual** tables entered by the +experimenter. `SegmentationParam` is a **Lookup** table holding reference +parameter sets. `AverageFrame` is **Imported** — its `make()` reads the +TIFF identified by `Scan` and stores the mean fluorescence frame. +`Segmentation` is **Computed** — its primary key fans in from both +`AverageFrame` and `SegmentationParam`, so every average frame is +segmented with every parameter set automatically. `Fluorescence` then +extracts per-ROI time-series traces from each segmentation. No external +scheduler is consulted: the foreign-key graph dictates what may run, what +must run first, and what already exists. The pipeline DAG and the database +schema are the same object. + +## The deliberate trade-off + +Decoupled architectures have legitimate advantages. File-based workflow +systems optimize for portability — any tool that reads files works. +Orchestrators evolve independently of the data model. Lakehouses give +analytics teams a layer that doesn't bind them to upstream pipeline +choices. These are the right trade-offs for many use cases. + +DataJoint accepts tighter coupling deliberately. The cost is framework +commitment. The benefit is one system that knows the data structure, the +data, the computation that produced it, the dependencies between +computations, and the integrity constraints that govern all of it. +Everything an analyst, an engineer, or an AI agent might ask about the +work — *what is this, where did it come from, what depends on it, what +must hold for it to be valid, what would change if I touched the input* — +is answerable by query against a single formal model. For scientific +workflows where the data and the computation cannot be cleanly separated +without losing the science, this is the right trade-off. + +## Substrate consequences + +Because dependencies are declared before any computation runs, provenance +and lineage become **properties of the substrate**, not artifacts assembled +after the fact. Every row in `Segmentation` is reachable by foreign key +from the exact `AverageFrame` and `SegmentationParam` that produced it; +cascade deletes remove dependent results when their inputs become invalid. +Reproducibility is structural rather than retrofitted by audit: a computed +result cannot exist without its upstream entities, and the declared types +and constraints must hold. The model enforces what other systems merely +log. The lineage graph is already in the schema; mapping it to external +standards such as W3C PROV or OpenLineage is a translation, not a +reconstruction. + +The same property makes the schema a shared contract between humans and +the machines that increasingly collaborate with them. The schema is +**self-describing**: an agent can introspect table structure, dependencies, +and state programmatically. Operations are **safe by default**: invalid +joins, type mismatches, and referential violations fail cleanly rather +than corrupting data silently. The dependency graph is **explicit**: +agents reason about execution order without implicit knowledge. Core +operations are **idempotent**: retries on failure are without side effects. +And all state — job status, computation progress, errors — is +**queryable**, so the work is observable as it happens. These are the +properties that let agents participate in scientific workflows with the +same transactional guarantees that protect human-initiated work. + +## Beneath the model + +The remaining sections detail the structural elements that make the model +work in practice. + +### Workflow steps and table tiers + +Tables are classified into tiers by data-entry mode: | Tier | Role | `make()` | |------|------|----------| | **Manual** | Receive direct user entry | No | | **Lookup** | Hold reference data | No | -| **Imported** | Reach out to data sources outside the DataJoint system (instruments, electronic lab notebooks, external databases) | Yes | +| **Imported** | Reach out to data sources outside DataJoint (instruments, ELNs, external databases) | Yes | | **Computed** | Derive their contents entirely from upstream DataJoint tables | Yes | Imported and Computed tables define computations via `make()` methods. The -`make()` method specifies how each entity is derived—this computation logic is -declared within the table definition, making it part of the schema itself -rather than an external workflow specification. - -### Dependencies as Foreign Keys - -Foreign keys define computational dependencies, not only referential integrity. -The dependency graph is explicit, queryable, and enforced by the database. +`make()` method specifies how each entity is derived — declared within the +table definition, not in an external workflow file. -```mermaid -graph LR - A[Session] --> B[Scan] - B --> C[Segmentation] - C --> D[Analysis] -``` - -### Master-Part Relationships +### Master-part relationships Master-part relationships declare transactional grouping directly in the -schema: the master table represents the workflow step, while part tables hold -the individual items. Insertions and deletions cascade as a unit, enforcing -transactional semantics without application code. - -### Directed Acyclic Graph - -Dependencies between tables form a directed acyclic graph (DAG); aggregated -dependencies between schemas likewise form a DAG. Unlike task DAGs in -workflow managers, these are *relational schema* DAGs—they define data -structure and relationships, not just execution steps. - -## Active Schemas - -The key distinction from classical models: traditional schemas are -*passive*—containers for data produced by external processes. In the -relational workflow model, the schema is *active*—Computed tables declare how -their contents are derived, making the schema itself the workflow -specification. Schemas are defined as Python classes, and entire pipelines are -organized as self-contained code repositories—version-controlled, testable, -and deployable using standard software engineering practices. - -A useful analogy: electronic spreadsheets unified data and computation—cells -with values alongside cells with formulas. Yet this integration never -penetrated relational databases in their 50+ years of history. The relational -workflow model brings to databases what spreadsheets brought to tabular -calculation: the recognition that data and the computations that produce it -belong together. The analogy has limits: spreadsheets' coupling is also the -source of their well-known fragility. DataJoint addresses this through formal -schema constraints and explicit dependency declaration rather than ad-hoc cell -references. - -## Workflow Normalization - -> **"Every table represents an entity type created at a specific workflow -> step, and all attributes describe that entity as it exists at that step."** - -Database normalization decomposes data into tables to eliminate redundancy. -Classical normalization theory achieves this through normal forms based on -functional dependencies. Entity normalization asks whether each attribute -describes the entity identified by the primary key. Workflow normalization -extends these principles with a temporal dimension. - -A Session table contains attributes known when the session is entered (date, -experimenter, subject). Analysis parameters determined later belong in -Computed tables that depend on Session. This discipline prevents tables that -accumulate attributes from different workflow stages, obscuring provenance and +schema. The master table represents the workflow step; part tables hold +the items produced together. Insertions and deletions cascade as a unit, +enforcing transactional semantics without application code. + +### Workflow normalization + +> "Every table represents an entity type created at a specific workflow +> step, and all attributes describe that entity as it exists at that +> step." + +Classical normalization theory decomposes tables to eliminate redundancy +through normal forms based on functional dependencies. Entity normalization +asks whether each attribute describes the entity identified by the primary +key. **Workflow normalization** extends these principles with a temporal +dimension: each table's attributes must describe its entity *as it exists +at the workflow step the table represents*. A `Session` table holds +attributes known when the session is entered (date, experimenter, +subject); analysis parameters determined later belong in Computed tables +that depend on `Session`. The discipline prevents tables that accumulate +attributes from different workflow stages, obscuring provenance and complicating updates. -## Entity Integrity +### Entity integrity All data is represented as well-formed entity sets with primary keys -identifying each entity uniquely. This eliminates redundancy and ensures -consistent updates. +identifying each entity uniquely. When upstream data is deleted, dependent +results cascade-delete automatically — including associated objects in +external storage. To correct errors, you delete, reinsert, and recompute, +ensuring every result represents a consistent computation from valid +inputs. -When upstream data is deleted, dependent results cascade-delete -automatically—including associated objects in external storage. To correct -errors, you delete, reinsert, and recompute, ensuring every result represents -a consistent computation from valid inputs. - -## Query Algebra +### Query algebra and algebraic closure DataJoint provides a five-operator algebra: @@ -143,14 +195,14 @@ DataJoint provides a five-operator algebra: The algebra achieves *algebraic closure*: every operator produces a valid entity set with a well-defined primary key, enabling unlimited composition. -This preservation of entity integrity—every query result is itself a proper -entity set with clear identity—distinguishes DataJoint's algebra from SQL, -where query results lack both a well-defined primary key and a clear entity -type. +This preservation of entity integrity — every query result is itself a +proper entity set with clear identity — distinguishes DataJoint's algebra +from SQL, where query results lack both a well-defined primary key and a +clear entity type. -## From Transactions to Transformations +## From transactions to transformations -| Traditional View | Workflow View | +| Traditional view | Workflow view | |------------------|---------------| | Tables store data | Tables represent workflow steps | | Rows are records | Rows are workflow artifacts | @@ -159,10 +211,23 @@ type. | Schemas organize storage | Schemas specify pipelines | | Queries retrieve data | Queries trace provenance | -## Summary - -The relational workflow model offers a new way to understand relational -databases—not merely as storage systems but as computational substrates. By -interpreting tables as workflow steps and foreign keys as execution -dependencies, the schema becomes a complete specification of how data is -derived, not just what data exists. +## Further reading + +The Relational Workflow Model and its technical innovations are formally +defined in [Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585), +which also introduces the further substrate elements that build on it: +object-augmented schemas, semantic matching by attribute lineage, an +extensible type system, and distributed job coordination. DataJoint's +schema definition language and query algebra were first formalized in +[Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104). + +### See also + +- [Data Pipelines](data-pipelines.md) — table tiers, schema organization, and the DAG in practice +- [Computation Model](computation-model.md) — the `make()` contract, `populate()`, and the key source +- [Entity Integrity](entity-integrity.md) — primary keys and the three questions every table answers +- [Normalization](normalization.md) — entity normalization extended with a temporal dimension +- [Query Algebra](query-algebra.md) — the five-operator algebra with algebraic closure +- [Semantic Matching](semantic-matching.md) — lineage-based join resolution +- [Type System](type-system.md) — extensible types with pluggable codecs +- [Define Tables](../how-to/define-tables.md) and [Run Computations](../how-to/run-computations.md) — declaring steps and executing them