Skip to content

Add two deeper concept pages: Schema as a Workflow Specification + Comparison to Workflow Languages#185

Open
dimitri-yatsenko wants to merge 2 commits into
mainfrom
add/schema-as-spec-and-comparison
Open

Add two deeper concept pages: Schema as a Workflow Specification + Comparison to Workflow Languages#185
dimitri-yatsenko wants to merge 2 commits into
mainfrom
add/schema-as-spec-and-comparison

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Context

The Relational Workflow Model concept page (overview / paradigm) and the
component pages under Concepts > Data Model (Entity Integrity,
Normalization, Computation Model) leave two reader needs unmet:

  1. How is the schema a formal language? An informatics-knowledgeable
    reader asks for the grammar, the typed semantics, the algebra, and the
    machine-readable surface — Hal Stern's question on the June 12 call:
    "Python is not a formal spec — is there a grammar? Can it be published
    as YAML? Is there an API set for it?"
  2. How does DataJoint relate to the workflow-language landscape they
    already know?
    A fair structural comparison against CWL, Snakemake,
    Nextflow, Airflow, Argo, Prefect, and Dagster — and guidance on when
    each fits.

This PR adds two new pages that close those gaps and integrates them
with the existing concept set.

Changes

New pages

  • explanation/schema-as-workflow-specification.md (~1,150 words)

    • Names the Relational Workflow Model as DataJoint's major innovation
      and positions the schema as the formal language expressing it
    • Grammar — annotated DDL excerpt (Scan, AverageFrame,
      SegmentationParam, Segmentation) showing the --- separator, ->
      foreign keys, codec types, tier decoration
    • Semantics — three-condition existence rule for a Computed row,
      make() as a typed function, git-hash code provenance per row
    • The query algebra (brief + link)
    • Types (brief + link)
    • Self-healing operational semantics — populate() brings the world
      into compliance with the schema
    • Machine-readability and export — DOT/Mermaid, YAML/JSON, W3C PROV,
      OpenLineage, PROV-O, workflow-language conversion
    • The schema as control plane — declarative, queryable, enforceable,
      observable (parallel to network routing tables)
  • explanation/comparison-to-workflow-languages.md (~870 words)

    • Fair structural comparison against file-based workflow systems (CWL,
      Snakemake, Nextflow) and task orchestrators (Airflow, Argo, Prefect,
      Dagster), with adjacent categories (data catalogs, lakehouses) noted
      but separated
    • Side-by-side table across nine concerns (data structure, types, FK
      integrity, computation spec, execution order, provenance, drift
      detection, query interface, retry/idempotence)
    • What workflow languages offer, what they omit, DataJoint's deliberate
      trade-off (paraphrased from Yatsenko & Nguyen 2026 Section 5)
    • Convertibility — any CWL workflow translates mechanically to a
      DataJoint schema and back; DataJoint adds the data-structure layer
      that workflow languages omit; GATK WGS example referenced
    • When to choose what, including the "use both" production pattern
      (DataJoint inside an Airflow / Argo / Prefect orchestration)

Integration with existing concept set

  • Nav (mkdocs.yaml): place the two new pages at the end of the
    Data Model group so the progression reads
    paradigm > components > synthesis > comparison:
    RWM > Entity Integrity > Normalization > Computation Model >
    Schema as a Workflow Specification > Comparison to Workflow Languages.
  • Concepts landing page (explanation/index.md): cards added for
    both new pages.
  • FAQ (faq.md): the "Is DataJoint a Workflow Management System?"
    answer overlapped substantively with the new Comparison page; trimmed
    it to a two-paragraph pointer.
  • Data Pipelines (data-pipelines.md): the "Comparing Approaches"
    table was a mini-version of the new Comparison page; trimmed to a
    short paragraph + pointer.

Merge order with PR #184

Both new pages cross-reference the expanded Relational Workflow Model
page from PR #184. Suggested merge order:

  1. PR Expand Relational Workflow Model concept page #184 (expand RWM intro) first
  2. This PR second

If merged in the opposite order, the new pages still resolve their links
correctly — the cross-references just read against the older, shorter
RWM page until #184 lands.

…mparison to Workflow Languages

Two new pages under Concepts > Data Model that follow from the
Relational Workflow Model overview and address the informed-reader
questions the overview page cannot answer in its scope:

1. Schema as a Workflow Specification
   - Names the Relational Workflow Model as DataJoint's major innovation
   - Describes the schema as a formal language: grammar (annotated DDL
     excerpt for the Scan / AverageFrame / SegmentationParam /
     Segmentation pipeline), typed semantics (three-condition existence
     rule for a Computed row), the make() contract recording the git
     hash of the producing code, the five-operator algebra with
     closure, the type system, populate() as the self-healing engine
     that brings the world into compliance with the schema, and
     machine-readability / export pathways (DOT, Mermaid, YAML, JSON,
     W3C PROV, OpenLineage, PROV-O, workflow-language conversion).
   - Closes with the schema-as-control-plane framing (parallel to
     routing tables in a network control plane).

2. Comparison to Workflow Languages
   - Fair, structural comparison against CWL, Snakemake, Nextflow
     (file-based workflows) and Airflow, Argo, Prefect, Dagster (task
     orchestrators). Adjacent categories (data catalogs, lakehouses)
     noted but flagged as solving different problems.
   - Side-by-side table across nine concerns (data structure, types,
     FK integrity, computation, execution order, provenance, drift
     detection, query interface, retry semantics).
   - What workflow languages offer, what they omit, DataJoint's
     deliberate trade-off (paraphrasing Section 5 of Yatsenko & Nguyen
     2026).
   - Convertibility: any CWL workflow translates mechanically to a
     DataJoint schema and back, with the data-structure layer the
     workflow language omits supplied on conversion. GATK WGS pipeline
     used as the empirical reference.
   - "When to choose what" guidance including the "use both" pattern
     (DataJoint inside an Airflow / Argo / Prefect orchestration).

Nav: both pages inserted under Concepts > Data Model after Relational
Workflow Model and before Entity Integrity, in mkdocs.yaml.
…ines

Cohesion pass after adding Schema as a Workflow Specification and
Comparison to Workflow Languages:

- Nav (mkdocs.yaml): move the two new pages to the end of the Data Model
  group so the progression reads paradigm > components > synthesis >
  comparison: Relational Workflow Model > Entity Integrity > Normalization
  > Computation Model > Schema as a Workflow Specification > Comparison
  to Workflow Languages.
- Concepts index (explanation/index.md): add cards for both new pages.
- FAQ (faq.md): the "Is DataJoint a Workflow Management System?" answer
  was duplicating the Comparison page; trim it to a two-paragraph
  pointer to the new page.
- Data Pipelines (data-pipelines.md): the "Comparing Approaches" table
  was a mini version of the new Comparison page; trim to a short
  paragraph + pointer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant