Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions .agents/skills/pipeline-doc-authoring/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
name: pipeline-doc-authoring
description: Add or update this repo's automated pipeline documentation for the US data build. Use when Codex needs to document a new pipeline step, refresh stale diagram metadata after code changes, add `@pipeline_node` decorators, update `pipeline_stages.yaml` edges or groups, regenerate `docs/pipeline-diagrams/app/pipeline.json`, or verify that the ReactFlow/ELK docs stay aligned with the underlying pipeline code.
---

# Pipeline Doc Authoring

Use this skill to extend or repair the repository's automated pipeline documentation. The
documentation system is metadata-driven: code decorators describe process nodes, YAML describes
stage structure and edges, the extractor merges them into generated JSON, and the docs app renders
that JSON.

## Workflow

1. Map the real pipeline flow before editing docs. Read
[references/source-of-truth.md](references/source-of-truth.md) first. Then inspect the actual
driver method or orchestration path that runs the behavior you want to document. Derive order
from the code, not from old YAML or old rendered JSON.

1. Choose the right documentation shape.

- Use a normal decorated node for a pipeline-visible step that meaningfully transforms data or
produces an artifact.
- Use a YAML `extra_node` for fixed inputs, outputs, utilities, or external systems.
- Use a YAML `group` for orchestration wrappers whose important content is already expanded into
substeps.
- Do not create nodes for trivial helpers that are only implementation detail.

1. Update code metadata. Add or refresh `@pipeline_node(PipelineNode(...))` on the implementing
function in its real source file. Keep the `id` stable and unique. Write `description` and
`details` from the current behavior, not from historical intent.

1. Update stage structure in `pipeline_stages.yaml`. Add or update stage descriptions,
`extra_nodes`, `groups`, and `edges`. Edges control both graph order and node membership in a
stage. A decorated node will not render unless at least one stage edge references its `id`.

1. Regenerate and validate. Run:

- `python scripts/extract_pipeline.py`
- `python -m py_compile scripts/extract_pipeline.py <changed-python-files>`
- `cd docs/pipeline-diagrams && npx tsc --noEmit`
- `git diff --check`

Run `cd docs/pipeline-diagrams && npm run lint` when you touch the renderer. For pure metadata
changes, TypeScript plus extractor validation is usually enough.

1. Review the generated diff semantically. Confirm that the new node or group appears in the
intended stage, with the intended neighbors, and that the stage description still matches the
real `generate()` or orchestration order.

## Strong Patterns

- Read [references/examples-and-pitfalls.md](references/examples-and-pitfalls.md) before adding a
new node type or stage pattern.
- Reuse a single decorated node ID across multiple stages only when the underlying function is
genuinely the same conceptual step in both places.
- Prefer wrapper groups over duplicate wrapper nodes when a function is just an orchestrator around
already-documented substeps.

## Pitfalls

- Do not edit `docs/pipeline-diagrams/app/pipeline.json` by hand. It is generated.
- Do not add a decorator without wiring the node into `pipeline_stages.yaml`; the extractor treats
unexpected unused decorators as errors.
- Do not fix stale docs by changing YAML alone when the code metadata is also wrong; update both.
- Do not add fake edges just to make a wrapper function visible. Use `groups` when the wrapper is a
visual boundary, not a data-flow step.
- Do not assume a clean textual rebase means the docs are aligned. Re-read the rebased code path
after merges to `main`.
4 changes: 4 additions & 0 deletions .agents/skills/pipeline-doc-authoring/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "Pipeline Doc Authoring"
short_description: "Update repo pipeline docs safely"
default_prompt: "Use $pipeline-doc-authoring to add or update automated pipeline documentation for this repository."
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Examples and Pitfalls

## Strong examples

### Stage 2 Extended CPS process nodes

- `policyengine_us_data/datasets/cps/extended_cps.py`
- `clone_features`
- `cps_only`
- `qrf_pass2`
- `formula_drop`

Why these are good examples:

- They document real pipeline-visible transformations.
- Their wording tracks the current code path rather than vague historical descriptions.
- Their stage order is enforced in `pipeline_stages.yaml`.

### Shared node reused across stages

- `policyengine_us_data/utils/mortgage_interest.py`
- `mortgage_convert`

Why this is a good example:

- The same decorated function is reused in both Stage 1 and Stage 2 by referencing the same node ID
in multiple stage edge sets.

### Orchestration wrapper rendered as a group

- `pipeline_stages.yaml`
- Stage `3b` group for `create_stratified_cps_dataset()`
- Stage `5` and `6` groups for `run_calibration()`

Why this is a good example:

- The wrapper is visible without creating fake data-flow nodes.
- The real substeps remain the actual graph nodes.

### Local-area build orchestration

- `policyengine_us_data/calibration/publish_local_area.py`
- `build_h5`
- `phase1`
- `phase2`
- `phase3`

Why this is a good example:

- It documents a multi-phase orchestrator while still allowing `main` to evolve helper behavior
underneath it.

## Common pitfalls

### Decorator added, node still missing

Cause:

- The node ID is not referenced by any stage edge in `pipeline_stages.yaml`.

Fix:

- Add the node to the correct stage by wiring its ID into one or more edges.

### Wrapper function duplicated as a normal node

Cause:

- The function is only an orchestrator around already-expanded substeps.

Fix:

- Prefer a `group` unless the wrapper itself is a meaningful pipeline step.

### Description is technically true but still stale

Cause:

- `main` changed the behavior inside the decorated function after the docs were first written.

Fix:

- Re-read the rebased code path and update `description`, `details`, and stage text together.

### Visual renderer edited for a metadata problem

Cause:

- A missing node or wrong order is treated as a ReactFlow/ELK issue.

Fix:

- Fix decorators, YAML, or extractor output first. Touch renderer code only when the rendering model
itself is wrong.

### Shared node reused incorrectly

Cause:

- The same node ID is reused across stages for two behaviors that have drifted apart.

Fix:

- Split into separate node IDs when the semantics differ, even if the implementation used to be
shared.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Source of Truth

Use these files in this order when updating pipeline docs.

## Core contract

- `policyengine_us_data/pipeline_metadata.py`
- Defines the `@pipeline_node` decorator used in source files.
- `policyengine_us_data/pipeline_schema.py`
- Defines the JSON-facing schema for stages, nodes, edges, and groups.

## Extraction and validation

- `scripts/extract_pipeline.py`
- Scans decorated source files.
- Merges code nodes with `pipeline_stages.yaml`.
- Fails on unexpected unused decorators.
- Contains the allowlist for intentionally omitted wrapper nodes.

Important rule: stage edges determine which decorated nodes are included in a stage. If a node ID is
never referenced by an edge, it will not appear in the generated graph.

## Stage structure

- `pipeline_stages.yaml`
- Stage descriptions
- Manual inputs/outputs/utilities
- Data-flow and utility edges
- Visual wrapper groups

## Generated artifact

- `docs/pipeline-diagrams/app/pipeline.json`
- Generated output consumed by the docs app
- Never edit by hand

## Docs app

- `docs/pipeline-diagrams/app/components/PipelineDiagram.tsx`
- Renders flat nodes plus wrapper groups
- Most metadata-only changes should not require edits here
- `docs/pipeline-diagrams/README.md`
- Operator workflow for regeneration and checks

## Validation commands

- `python scripts/extract_pipeline.py`
- `python -m py_compile scripts/extract_pipeline.py <changed-python-files>`
- `cd docs/pipeline-diagrams && npx tsc --noEmit`
- `git diff --check`
28 changes: 28 additions & 0 deletions .github/commit-pipeline-diagram-json.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash

set -euo pipefail

PIPELINE_JSON="docs/pipeline-diagrams/app/pipeline.json"

append_summary() {
if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
{
echo "## Pipeline Diagram Docs"
echo
echo "$1"
} >> "$GITHUB_STEP_SUMMARY"
fi
}

if git diff --quiet -- "$PIPELINE_JSON"; then
append_summary "No generated pipeline JSON changes detected."
exit 0
fi

git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add "$PIPELINE_JSON"
git commit -m "Auto-update pipeline JSON"
git push origin HEAD:main

append_summary "Updated \`$PIPELINE_JSON\` on \`main\`. Connected Vercel deployment will pick up that commit."
59 changes: 56 additions & 3 deletions .github/workflows/push.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:
jobs:
lint:
runs-on: ubuntu-latest
if: github.event.head_commit.message != 'Auto-update pipeline JSON'
steps:
- uses: actions/checkout@v4
- run: pip install ruff>=0.9.0
Expand All @@ -16,7 +17,7 @@ jobs:
build-and-test:
runs-on: ubuntu-latest
needs: lint
if: github.event.head_commit.message != 'Update package version'
if: github.event.head_commit.message != 'Update package version' && github.event.head_commit.message != 'Auto-update pipeline JSON'
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
Expand All @@ -36,7 +37,7 @@ jobs:
# ── Documentation ──────────────────────────────────────────
docs:
runs-on: ubuntu-latest
if: github.event.head_commit.message != 'Update package version'
if: github.event.head_commit.message != 'Update package version' && github.event.head_commit.message != 'Auto-update pipeline JSON'
permissions:
contents: write
steps:
Expand All @@ -63,7 +64,7 @@ jobs:
# ── Versioning (bump + changelog on non-version-bump pushes) ──
versioning:
runs-on: ubuntu-latest
if: github.event.head_commit.message != 'Update package version'
if: github.event.head_commit.message != 'Update package version' && github.event.head_commit.message != 'Auto-update pipeline JSON'
steps:
- name: Generate GitHub App token
id: app-token
Expand Down Expand Up @@ -92,6 +93,58 @@ jobs:
add: "."
message: Update package version

# ── Pipeline diagram sync (after versioning) ───────────────
pipeline-diagram-sync:
runs-on: ubuntu-latest
needs: versioning
if: needs.versioning.result == 'success' && github.event.head_commit.message != 'Update package version' && github.event.head_commit.message != 'Auto-update pipeline JSON'
permissions:
contents: write
steps:
- name: Generate GitHub App token
id: app-token
uses: actions/create-github-app-token@v1
with:
app-id: ${{ secrets.APP_ID }}
private-key: ${{ secrets.APP_PRIVATE_KEY }}

- uses: actions/checkout@v4
with:
token: ${{ steps.app-token.outputs.token }}
fetch-depth: 0

- name: Refresh to latest main after versioning
run: |
git fetch origin main
git checkout --detach FETCH_HEAD

- uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install extractor dependencies
run: python -m pip install PyYAML

- uses: actions/setup-node@v4
with:
node-version: "24"
cache: "npm"
cache-dependency-path: docs/pipeline-diagrams/package-lock.json

- name: Rebuild pipeline JSON
run: python scripts/extract_pipeline.py

- name: Install diagram app dependencies
working-directory: docs/pipeline-diagrams
run: npm ci

- name: Build pipeline diagram docs
working-directory: docs/pipeline-diagrams
run: npm run build

- name: Commit pipeline diagram JSON if changed
run: bash .github/commit-pipeline-diagram-json.sh

# ── PyPI publish (version bump commits only) ────────────────
publish:
runs-on: ubuntu-latest
Expand Down
8 changes: 7 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: all format test test-unit test-integration install download upload docker documentation data validate-data calibrate calibrate-build publish-local-area upload-calibration upload-dataset push-to-modal build-data-modal build-matrices calibrate-modal calibrate-modal-national calibrate-both stage-h5s stage-national-h5 stage-all-h5s pipeline validate-staging validate-staging-full upload-validation check-staging check-sanity clean build paper clean-paper presentations database database-refresh promote-dataset promote build-h5s validate-local refresh-soi-targets push-pr-branch
.PHONY: all format test test-unit test-integration install download upload docker documentation data validate-data calibrate calibrate-build publish-local-area upload-calibration upload-dataset upload-database push-to-modal build-data-modal build-matrices calibrate-modal calibrate-modal-national calibrate-both stage-h5s stage-national-h5 stage-all-h5s pipeline validate-staging validate-staging-full upload-validation check-staging check-sanity clean build paper clean-paper presentations database database-refresh promote-database promote-dataset promote build-h5s validate-local refresh-soi-targets push-pr-branch diagrams-install diagrams

SOI_SOURCE_YEAR ?= 2021
SOI_TARGET_YEAR ?= 2023
Expand Down Expand Up @@ -298,3 +298,9 @@ presentations/nta_2024_11/nta_2024_slides.pdf: presentations/nta_2024_11/main.te
cd presentations/nta_2024_11 && \
pdflatex -jobname=nta_2024_slides main && \
pdflatex -jobname=nta_2024_slides main

diagrams-install:
cd docs/pipeline-diagrams && npm install

diagrams:
cd docs/pipeline-diagrams && npx next dev
1 change: 1 addition & 0 deletions changelog.d/pipeline-diagrams.changed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Expanded the automated pipeline documentation diagrams to cover clone-feature rematching, structural mortgage conversion, and wrapper-group steps in the US data build.
Loading
Loading