datajoint · dimitri-yatsenko · Jun 13, 2026 · Jun 13, 2026
diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -16,6 +16,8 @@ nav:
           - Entity Integrity: explanation/entity-integrity.md
           - Normalization: explanation/normalization.md
           - Computation Model: explanation/computation-model.md
+          - Schema as a Workflow Specification: explanation/schema-as-workflow-specification.md
+          - Comparison to Workflow Languages: explanation/comparison-to-workflow-languages.md
       - Queries:
           - Query Algebra: explanation/query-algebra.md
           - Semantic Matching: explanation/semantic-matching.md

diff --git a/src/explanation/comparison-to-workflow-languages.md b/src/explanation/comparison-to-workflow-languages.md
@@ -0,0 +1,128 @@
+# Comparison to Workflow Languages
+
+DataJoint and workflow languages are often compared because both express
+pipelines as directed graphs of computational steps. The comparison is not
+"which is best" — these tools were designed for different problems, with
+different assumptions about where data structure lives. This page lays out
+where each category fits in the broader landscape and what DataJoint adds
+on top.
+
+## The landscape
+
+The systems usually grouped with DataJoint divide cleanly into two
+categories with distinct design centers, plus two adjacent categories that
+solve different problems entirely.
+
+| Category | Examples | Design center |
+|---|---|---|
+| **File-based workflow systems** | CWL, Snakemake, Nextflow | File-passing between steps; scheduler-agnostic; portability-first |
+| **Task orchestrators** | Airflow, Argo Workflows, Prefect, Dagster | DAG of tasks; execution-focused; data-agnostic |
+| Data catalogs | DataHub, Atlan, Marquez | Describe data after it lands |
+| Lakehouses | Delta, Iceberg, Hudi | Optimize analytical queries over stored tables |
+
+The two adjacent categories — catalogs and lakehouses — appear in the same
+conversations but address different concerns. Catalogs describe and tag
+data that already exists; lakehouses optimize analytical access to it.
+Neither specifies how the data was produced. They compose with DataJoint
+rather than competing with it.
+
+## Side-by-side comparison
+
+| Concern | File-based workflows | Task orchestrators | DataJoint |
+|---|---|---|---|
+| Data structure / schema | — (files are opaque) | — (tasks pass artifacts) | Declared in schema |
+| Type system | File-type tags | Python objects | Extensible, pluggable codecs |
+| Foreign-key integrity | — | — | Enforced |
+| Computation specification | Workflow file (CWL/SMK/NF) | Task functions in code | `make()` declared in schema |
+| Execution order | Step DAG in workflow file | Task DAG in code | Foreign-key DAG in schema |
+| Provenance recording | Reconstructed from run logs | Task-level run history | Structural (FK chain) |
+| Drift detection | Out of scope | Out of scope | Cascade on upstream change |
+| Query interface | Filesystem + ad hoc | Task metadata UI | Five-operator algebra |
+| Retry / idempotence | Step-level rerun | Task-level retry | Per-entity, key-driven |
+
+## What workflow languages offer
+
+The decoupled architectures embodied by CWL, Snakemake, Nextflow, Airflow,
+Argo, Prefect, and Dagster have real and lasting advantages. Portability
+across compute backends — any tool that reads files works — is a first-class
+property. Independent evolution of data and computation layers lets
+analysis code change without touching a data model, and lets the compute
+engine swap freely between Spark, Dask, GPU clusters, or HPC schedulers.
+Language-agnosticism keeps the workflow specification readable across
+teams. Decoupling aligns naturally with organizational boundaries: data
+engineers, scientists, and DevOps can evolve their layers independently.
+These are the right trade-offs when portability and decoupling are the
+top priorities.
+
+## What they omit
+
+What these systems share is what they decline to specify: a formal
+data-structure layer. There are no typed schemas across pipeline stages,
+no foreign keys binding intermediate results, no algebraic query surface
+over what the pipeline has produced. Provenance is reconstructed from run
+logs and filenames rather than enforced by structure. Entity-level lineage
+— which subject or sample or session produced a result — is implicit in
+directory conventions and scatter patterns rather than declared. Drift in
+upstream inputs is not detectable as a structural fact; it is something a
+human notices and chases down. These omissions are deliberate: keeping the
+data-structure layer out of scope is what makes the workflow language
+portable.
+
+## DataJoint's deliberate trade-off
+
+DataJoint accepts tighter coupling on purpose. The cost is framework
+commitment — the data model, the schema, and the execution semantics live
+in one system. The benefit is one formal model in which data structure,
+the data itself, the computation that produced it, the dependencies
+between computations, and the integrity constraints that govern all of it
+are jointly queryable and machine-readable. Every question an analyst,
+engineer, or AI agent might pose about the work — *what is this, where
+did it come from, what depends on it, what must hold for it to be valid,
+what would change if I touched the input* — is answerable by query against
+a single formal model. For scientific workflows where data and computation
+cannot be cleanly separated without losing the science, this is the
+trade-off worth taking. The argument is developed at length in
+[Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585), Section 5.
+
+## Convertibility
+
+The two categories are not mutually exclusive at the structural level. Any
+CWL workflow can be mechanically converted to a DataJoint schema: each tool
+step becomes a Computed table, the step DAG becomes a foreign-key chain,
+and the scatter/gather patterns map onto primary keys. The conversion is
+reversible — a DataJoint schema exports to CWL or Nextflow DSL2 with one
+table per process and channel wiring mirroring the FK chain. The internal
+conversion exercise on the GATK whole-genome-sequencing pipeline from the
+Arvados tutorial — 20 CWL files, 13 tool steps after flattening —
+demonstrates this in practice.
+
+The conversion is not symmetric in information content. CWL→DataJoint
+adds the data-structure layer (entity names, typed primary keys, gather
+group keys) that the workflow language leaves implicit; a short
+annotation supplies these. DataJoint→CWL discards that layer, leaving the
+DAG and the per-step containers. In this sense the relational workflow
+model is a superset of what a workflow language specifies: the workflow
+language describes the DAG; DataJoint describes the DAG plus the data
+structure.
+
+## When to choose what
+
+- **Choose a workflow language** when portability across compute backends
+  is the top priority, the data structure is incidental to the work, and
+  the team is prepared to write its own catalog or lineage layer
+  separately.
+- **Choose DataJoint** when the data and the computation cannot cleanly
+  separate, when provenance, lineage, and integrity must be structural
+  rather than reconstructed, and when agents need a single machine-readable
+  model of the pipeline.
+- **Use both.** DataJoint inside an Airflow, Argo, or Prefect orchestration
+  is a common production pattern: DataJoint owns the data and computation
+  model; the orchestrator owns scheduling, resource allocation, and retry
+  policy. The two layers do not compete; they compose.
+
+## See also
+
+- [Relational Workflow Model](relational-workflow-model.md) — the conceptual basis for treating the schema as the pipeline specification
+- [Schema as a Workflow Specification](schema-as-workflow-specification.md) — the formal language properties (grammar, semantics, algebra) that make the schema queryable as a pipeline spec
+- [Computation Model](computation-model.md) — the `make()` contract and `populate()`
+- [Semantic Matching](semantic-matching.md) — lineage-based join resolution that workflow languages cannot express
diff --git a/src/explanation/data-pipelines.md b/src/explanation/data-pipelines.md
@@ -148,17 +148,9 @@ Throughout this process, the schema definition remains the single source of trut
 
 ## Comparing Approaches
 
-| Aspect | File-Based Approach | DataJoint Pipeline |
-|--------|--------------------|--------------------|
-| **Data Structure** | Implicit in filenames/folders | Explicit in schema definition |
-| **Dependencies** | Encoded in scripts | Declared through foreign keys |
-| **Provenance** | Manual tracking | Automatic through referential integrity |
-| **Reproducibility** | Requires careful discipline | Built into the model |
-| **Collaboration** | File sharing/conflicts | Concurrent database access |
-| **Queries** | Custom scripts per question | Composable query algebra |
-| **Scalability** | Limited by filesystem | Database + object-augmented storage |
-
-The pipeline approach requires upfront investment in schema design. This investment pays dividends through reduced errors, improved reproducibility, and efficient collaboration as projects scale.
+The pipeline approach requires upfront investment in schema design. Compared to a file-based approach where data structure is implicit in filenames, dependencies are encoded in scripts, and provenance must be tracked manually, a DataJoint pipeline makes all of those explicit in the schema — and pays the investment back in reproducibility, query power, and collaboration as projects scale.
+
+For a detailed structural comparison against file-based workflow systems (CWL, Snakemake, Nextflow) and task orchestrators (Airflow, Argo, Prefect, Dagster), and for guidance on when the two layers complement rather than substitute each other, see [Comparison to Workflow Languages](comparison-to-workflow-languages.md).
 
 ## Summary
 

diff --git a/src/explanation/faq.md b/src/explanation/faq.md
@@ -124,36 +124,9 @@ DataJoint can be considered an **ORM specialized for scientific databases**—pu
 
 ## Is DataJoint a Workflow Management System?
 
-Not exactly. DataJoint and workflow management systems (Airflow, Prefect, Flyte, Nextflow, Snakemake) solve related but distinct problems:
+Not exactly — and the two compose rather than compete. DataJoint formalizes the data layer (schema, dependencies, computation, integrity) while workflow managers (Airflow, Argo, Prefect, Dagster, Nextflow, Snakemake, CWL) orchestrate task scheduling and resource allocation. A common production pattern is DataJoint inside an Airflow / Argo / Prefect orchestration: DataJoint owns the data and computation model; the orchestrator owns scheduling, retries, and resource policy.
 
-| Aspect | Workflow Managers | DataJoint |
-|--------|-------------------|-----------|
-| Core abstraction | Tasks and DAGs | Tables and dependencies |
-| State management | External (files, databases) | Integrated (relational database) |
-| Scheduling | Built-in schedulers | External (or manual `populate()`) |
-| Distributed execution | Built-in | Via external tools |
-| Data model | Unstructured (files, blobs) | Structured (relational schema) |
-| Query capability | Limited | Full relational algebra |
-
-**DataJoint excels at:**
-
-- Defining *what* needs to be computed based on data dependencies
-- Ensuring computations are never duplicated
-- Maintaining referential integrity across pipeline stages
-- Querying intermediate and final results
-
-**Workflow managers excel at:**
-
-- Scheduling and orchestrating job execution
-- Distributing work across clusters
-- Retry logic and failure handling
-- Resource management
-
-**They complement each other.** DataJoint formalizes data dependencies so that external schedulers can effectively manage computational tasks. A common pattern:
-
-1. DataJoint defines the pipeline structure and tracks what's computed
-2. A workflow manager (or simple cron/SLURM scripts) calls [`populate()`](computation-model.md) on a schedule
-3. DataJoint determines what work remains and executes it
+For the structural comparison — what each category offers, what each omits, the convertibility between them, and guidance on when to use which — see [Comparison to Workflow Languages](comparison-to-workflow-languages.md).
 
 ## Is DataJoint a Lakehouse?
 

diff --git a/src/explanation/index.md b/src/explanation/index.md
@@ -40,6 +40,16 @@ and scalable.
 
     AutoPopulate and Jobs 2.0. Automated, reproducible, distributed computation.
 
+-   :material-file-document-edit: **[Schema as a Workflow Specification](schema-as-workflow-specification.md)**
+
+    The schema as a formal language for expressing scientific workflows.
+    Grammar, semantics, algebra, and machine-readability.
+
+-   :material-compare-horizontal: **[Comparison to Workflow Languages](comparison-to-workflow-languages.md)**
+
+    How DataJoint relates to CWL, Snakemake, Nextflow, Airflow, and other
+    workflow tools. What each offers, what each omits, and when to use both.
+
 -   :material-puzzle: **[Custom Codecs](custom-codecs.md)**
 
     Extend DataJoint with domain-specific types. The codec extensibility system.