Skip to content

Expand Relational Workflow Model concept page#184

Open
dimitri-yatsenko wants to merge 1 commit into
mainfrom
expand/relational-workflow-model-intro
Open

Expand Relational Workflow Model concept page#184
dimitri-yatsenko wants to merge 1 commit into
mainfrom
expand/relational-workflow-model-intro

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Context

The current Relational Workflow Model (RWM) concept page (src/explanation/relational-workflow-model.md) is understated relative to the model's significance. It reads as a brief positioning statement rather than as an entry point that lands the structural argument for a reader who already knows informatics (databases, FK graphs, ER modeling, workflow managers, lakehouses).

This PR expands the page to function as that entry point — the audience pictured is a knowledgeable peer (e.g., an infrastructure architect from pharma R&D evaluating where DataJoint sits in the landscape they already know).

Changes

  • Lead with the three-interpretations taxonomy (Codd / Chen / RWM) and the computational substrate framing from the DataJoint 2.0 preprint (Yatsenko & Nguyen, 2026, arXiv:2602.16585).
  • Name the surrounding tool categories explicitly and what each is silent on:
    • File-based workflow systems (CWL, Snakemake, Nextflow) — fragment provenance across the filesystem
    • Task orchestrators (Airflow, Argo, Prefect) — agnostic to data structure
    • Data catalogs (DataHub, Atlan, Marquez) — describe data after it lands
    • Lakehouses (Delta, Iceberg, Hudi) — treat computation as external
  • Add a worked-example pipeline (Mouse → Session → Scan → AverageFrame → Segmentation → Fluorescence, with SegmentationParam as Lookup) rendered as a mermaid diagram with tier-color classes.
  • Add a "deliberate trade-off" section that acknowledges the legitimate strengths of decoupled architectures and frames DataJoint's coupling as a chosen trade-off — directly drawn from the preprint Section 5.
  • Add a "substrate consequences" section that covers:
    • Provenance and lineage as structural properties of the substrate (mapping to W3C PROV / OpenLineage is translation, not reconstruction)
    • The five agent-substrate properties from the preprint: self-describing, safe by default, explicit dependencies, idempotent, observable
  • Preserve the existing detailed sections (table tiers, master-part, workflow normalization, entity integrity, query algebra with closure, transactions vs transformations) under a "Beneath the model" header for readers who want the structural detail.

Net change

+177 / -112 lines; one file.

Sources

  • Yatsenko & Nguyen, 2026 — DataJoint 2.0 whitepaper (computational substrate, four innovations, substrate properties for agents, deliberate-trade-off discussion)
  • Yatsenko et al., 2018 — original theoretical formalization (relational workflow model, query algebra)

Notes for reviewers

  • The mermaid diagram uses tier-color classDefs. If the docs site's mermaid theme overrides these, we may need to drop colors or adapt to the site theme.
  • Cross-references in See also all resolve against current src/explanation/ and src/how-to/ content.

The previous intro understated the model's significance. The expansion
positions the RWM for an informatics-knowledgeable reader:

- Lead with the three-interpretations taxonomy (Codd / Chen / RWM) and the
  computational-substrate framing from the DataJoint 2.0 preprint.
- Name the surrounding tool categories explicitly (CWL/Snakemake/Nextflow,
  Airflow/Argo/Prefect, DataHub/Atlan/Marquez, Delta/Iceberg/Hudi) and what
  each is silent on.
- Add a worked example pipeline (Mouse > Session > Scan > AverageFrame >
  Segmentation > Fluorescence, with SegmentationParam as Lookup) rendered
  as a mermaid diagram with tier colors.
- Add a "deliberate trade-off" section addressing the legitimate strengths
  of decoupled architectures and why DataJoint accepts coupling.
- Add a substrate-consequences section: provenance and lineage as
  structural properties (mapping to W3C PROV / OpenLineage is translation,
  not reconstruction), and the five agent-substrate properties
  (self-describing, safe by default, explicit dependencies, idempotent,
  observable) from the preprint.
- Preserve the existing detailed sections (table tiers, master-part,
  normalization, entity integrity, query algebra, transactions vs
  transformations) under a "Beneath the model" header for readers who want
  the structural detail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant