Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ nav:
- Entity Integrity: explanation/entity-integrity.md
- Normalization: explanation/normalization.md
- Computation Model: explanation/computation-model.md
- Schema as a Workflow Specification: explanation/schema-as-workflow-specification.md
- Comparison to Workflow Languages: explanation/comparison-to-workflow-languages.md
- Queries:
- Query Algebra: explanation/query-algebra.md
- Semantic Matching: explanation/semantic-matching.md
Expand Down
128 changes: 128 additions & 0 deletions src/explanation/comparison-to-workflow-languages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Comparison to Workflow Languages

DataJoint and workflow languages are often compared because both express
pipelines as directed graphs of computational steps. The comparison is not
"which is best" — these tools were designed for different problems, with
different assumptions about where data structure lives. This page lays out
where each category fits in the broader landscape and what DataJoint adds
on top.

## The landscape

The systems usually grouped with DataJoint divide cleanly into two
categories with distinct design centers, plus two adjacent categories that
solve different problems entirely.

| Category | Examples | Design center |
|---|---|---|
| **File-based workflow systems** | CWL, Snakemake, Nextflow | File-passing between steps; scheduler-agnostic; portability-first |
| **Task orchestrators** | Airflow, Argo Workflows, Prefect, Dagster | DAG of tasks; execution-focused; data-agnostic |
| Data catalogs | DataHub, Atlan, Marquez | Describe data after it lands |
| Lakehouses | Delta, Iceberg, Hudi | Optimize analytical queries over stored tables |

The two adjacent categories — catalogs and lakehouses — appear in the same
conversations but address different concerns. Catalogs describe and tag
data that already exists; lakehouses optimize analytical access to it.
Neither specifies how the data was produced. They compose with DataJoint
rather than competing with it.

## Side-by-side comparison

| Concern | File-based workflows | Task orchestrators | DataJoint |
|---|---|---|---|
| Data structure / schema | — (files are opaque) | — (tasks pass artifacts) | Declared in schema |
| Type system | File-type tags | Python objects | Extensible, pluggable codecs |
| Foreign-key integrity | — | — | Enforced |
| Computation specification | Workflow file (CWL/SMK/NF) | Task functions in code | `make()` declared in schema |
| Execution order | Step DAG in workflow file | Task DAG in code | Foreign-key DAG in schema |
| Provenance recording | Reconstructed from run logs | Task-level run history | Structural (FK chain) |
| Drift detection | Out of scope | Out of scope | Cascade on upstream change |
| Query interface | Filesystem + ad hoc | Task metadata UI | Five-operator algebra |
| Retry / idempotence | Step-level rerun | Task-level retry | Per-entity, key-driven |

## What workflow languages offer

The decoupled architectures embodied by CWL, Snakemake, Nextflow, Airflow,
Argo, Prefect, and Dagster have real and lasting advantages. Portability
across compute backends — any tool that reads files works — is a first-class
property. Independent evolution of data and computation layers lets
analysis code change without touching a data model, and lets the compute
engine swap freely between Spark, Dask, GPU clusters, or HPC schedulers.
Language-agnosticism keeps the workflow specification readable across
teams. Decoupling aligns naturally with organizational boundaries: data
engineers, scientists, and DevOps can evolve their layers independently.
These are the right trade-offs when portability and decoupling are the
top priorities.

## What they omit

What these systems share is what they decline to specify: a formal
data-structure layer. There are no typed schemas across pipeline stages,
no foreign keys binding intermediate results, no algebraic query surface
over what the pipeline has produced. Provenance is reconstructed from run
logs and filenames rather than enforced by structure. Entity-level lineage
— which subject or sample or session produced a result — is implicit in
directory conventions and scatter patterns rather than declared. Drift in
upstream inputs is not detectable as a structural fact; it is something a
human notices and chases down. These omissions are deliberate: keeping the
data-structure layer out of scope is what makes the workflow language
portable.

## DataJoint's deliberate trade-off

DataJoint accepts tighter coupling on purpose. The cost is framework
commitment — the data model, the schema, and the execution semantics live
in one system. The benefit is one formal model in which data structure,
the data itself, the computation that produced it, the dependencies
between computations, and the integrity constraints that govern all of it
are jointly queryable and machine-readable. Every question an analyst,
engineer, or AI agent might pose about the work — *what is this, where
did it come from, what depends on it, what must hold for it to be valid,
what would change if I touched the input* — is answerable by query against
a single formal model. For scientific workflows where data and computation
cannot be cleanly separated without losing the science, this is the
trade-off worth taking. The argument is developed at length in
[Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585), Section 5.

## Convertibility

The two categories are not mutually exclusive at the structural level. Any
CWL workflow can be mechanically converted to a DataJoint schema: each tool
step becomes a Computed table, the step DAG becomes a foreign-key chain,
and the scatter/gather patterns map onto primary keys. The conversion is
reversible — a DataJoint schema exports to CWL or Nextflow DSL2 with one
table per process and channel wiring mirroring the FK chain. The internal
conversion exercise on the GATK whole-genome-sequencing pipeline from the
Arvados tutorial — 20 CWL files, 13 tool steps after flattening —
demonstrates this in practice.

The conversion is not symmetric in information content. CWL→DataJoint
adds the data-structure layer (entity names, typed primary keys, gather
group keys) that the workflow language leaves implicit; a short
annotation supplies these. DataJoint→CWL discards that layer, leaving the
DAG and the per-step containers. In this sense the relational workflow
model is a superset of what a workflow language specifies: the workflow
language describes the DAG; DataJoint describes the DAG plus the data
structure.

## When to choose what

- **Choose a workflow language** when portability across compute backends
is the top priority, the data structure is incidental to the work, and
the team is prepared to write its own catalog or lineage layer
separately.
- **Choose DataJoint** when the data and the computation cannot cleanly
separate, when provenance, lineage, and integrity must be structural
rather than reconstructed, and when agents need a single machine-readable
model of the pipeline.
- **Use both.** DataJoint inside an Airflow, Argo, or Prefect orchestration
is a common production pattern: DataJoint owns the data and computation
model; the orchestrator owns scheduling, resource allocation, and retry
policy. The two layers do not compete; they compose.

## See also

- [Relational Workflow Model](relational-workflow-model.md) — the conceptual basis for treating the schema as the pipeline specification
- [Schema as a Workflow Specification](schema-as-workflow-specification.md) — the formal language properties (grammar, semantics, algebra) that make the schema queryable as a pipeline spec
- [Computation Model](computation-model.md) — the `make()` contract and `populate()`
- [Semantic Matching](semantic-matching.md) — lineage-based join resolution that workflow languages cannot express
14 changes: 3 additions & 11 deletions src/explanation/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,17 +148,9 @@ Throughout this process, the schema definition remains the single source of trut

## Comparing Approaches

| Aspect | File-Based Approach | DataJoint Pipeline |
|--------|--------------------|--------------------|
| **Data Structure** | Implicit in filenames/folders | Explicit in schema definition |
| **Dependencies** | Encoded in scripts | Declared through foreign keys |
| **Provenance** | Manual tracking | Automatic through referential integrity |
| **Reproducibility** | Requires careful discipline | Built into the model |
| **Collaboration** | File sharing/conflicts | Concurrent database access |
| **Queries** | Custom scripts per question | Composable query algebra |
| **Scalability** | Limited by filesystem | Database + object-augmented storage |

The pipeline approach requires upfront investment in schema design. This investment pays dividends through reduced errors, improved reproducibility, and efficient collaboration as projects scale.
The pipeline approach requires upfront investment in schema design. Compared to a file-based approach where data structure is implicit in filenames, dependencies are encoded in scripts, and provenance must be tracked manually, a DataJoint pipeline makes all of those explicit in the schema — and pays the investment back in reproducibility, query power, and collaboration as projects scale.

For a detailed structural comparison against file-based workflow systems (CWL, Snakemake, Nextflow) and task orchestrators (Airflow, Argo, Prefect, Dagster), and for guidance on when the two layers complement rather than substitute each other, see [Comparison to Workflow Languages](comparison-to-workflow-languages.md).

## Summary

Expand Down
31 changes: 2 additions & 29 deletions src/explanation/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,36 +124,9 @@ DataJoint can be considered an **ORM specialized for scientific databases**—pu

## Is DataJoint a Workflow Management System?

Not exactly. DataJoint and workflow management systems (Airflow, Prefect, Flyte, Nextflow, Snakemake) solve related but distinct problems:
Not exactly — and the two compose rather than compete. DataJoint formalizes the data layer (schema, dependencies, computation, integrity) while workflow managers (Airflow, Argo, Prefect, Dagster, Nextflow, Snakemake, CWL) orchestrate task scheduling and resource allocation. A common production pattern is DataJoint inside an Airflow / Argo / Prefect orchestration: DataJoint owns the data and computation model; the orchestrator owns scheduling, retries, and resource policy.

| Aspect | Workflow Managers | DataJoint |
|--------|-------------------|-----------|
| Core abstraction | Tasks and DAGs | Tables and dependencies |
| State management | External (files, databases) | Integrated (relational database) |
| Scheduling | Built-in schedulers | External (or manual `populate()`) |
| Distributed execution | Built-in | Via external tools |
| Data model | Unstructured (files, blobs) | Structured (relational schema) |
| Query capability | Limited | Full relational algebra |

**DataJoint excels at:**

- Defining *what* needs to be computed based on data dependencies
- Ensuring computations are never duplicated
- Maintaining referential integrity across pipeline stages
- Querying intermediate and final results

**Workflow managers excel at:**

- Scheduling and orchestrating job execution
- Distributing work across clusters
- Retry logic and failure handling
- Resource management

**They complement each other.** DataJoint formalizes data dependencies so that external schedulers can effectively manage computational tasks. A common pattern:

1. DataJoint defines the pipeline structure and tracks what's computed
2. A workflow manager (or simple cron/SLURM scripts) calls [`populate()`](computation-model.md) on a schedule
3. DataJoint determines what work remains and executes it
For the structural comparison — what each category offers, what each omits, the convertibility between them, and guidance on when to use which — see [Comparison to Workflow Languages](comparison-to-workflow-languages.md).

## Is DataJoint a Lakehouse?

Expand Down
10 changes: 10 additions & 0 deletions src/explanation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,16 @@ and scalable.

AutoPopulate and Jobs 2.0. Automated, reproducible, distributed computation.

- :material-file-document-edit: **[Schema as a Workflow Specification](schema-as-workflow-specification.md)**

The schema as a formal language for expressing scientific workflows.
Grammar, semantics, algebra, and machine-readability.

- :material-compare-horizontal: **[Comparison to Workflow Languages](comparison-to-workflow-languages.md)**

How DataJoint relates to CWL, Snakemake, Nextflow, Airflow, and other
workflow tools. What each offers, what each omits, and when to use both.

- :material-puzzle: **[Custom Codecs](custom-codecs.md)**

Extend DataJoint with domain-specific types. The codec extensibility system.
Expand Down
Loading
Loading