From c144ac272d95ba56112c9d759f0e0246c0c5ca3a Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Sat, 6 Jun 2026 01:24:58 +0530 Subject: [PATCH] ci(openapi): add oasdiff breaking-change gate (Wave 1 contract-drift) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wave 1 contract-drift gate — docs/ci/01-CI-INTEGRATION-DESIGN.md. Adds .github/workflows/openapi-breaking.yml: on every PR touching openapi.snapshot.json, diff it against the base branch's committed snapshot with oasdiff (pinned v1.11.7) and FAIL the PR on a breaking change (removed/renamed/retyped response field, new required request field, removed endpoint/response code). Pure additions pass. Uses --fail-on WARN, not ERR: the api emits response fields as optional, and oasdiff classifies removing an optional response property as WARN — yet that removal is exactly the login-break class (UI reads me.tier, api drops it, UI gets undefined). ERR would miss it; WARN catches it while still passing additive changes. Verified locally against removed-field, rename, new-required-request-field, removed-endpoint (all red) and optional-field-add / new-endpoint / response-enum-removal (all green). The companion consumer-side gate lives in instanode-web (openapi-typescript codegen makes the same rename fail tsc in npm run gate). Together they move contract drift from a runtime prod break to a compile/PR-time failure. Docs: api/docs/OPENAPI-CONTRACT-GATES.md. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/openapi-breaking.yml | 155 +++++++++++++++++++++++++ docs/OPENAPI-CONTRACT-GATES.md | 74 ++++++++++++ 2 files changed, 229 insertions(+) create mode 100644 .github/workflows/openapi-breaking.yml create mode 100644 docs/OPENAPI-CONTRACT-GATES.md diff --git a/.github/workflows/openapi-breaking.yml b/.github/workflows/openapi-breaking.yml new file mode 100644 index 00000000..69e10a01 --- /dev/null +++ b/.github/workflows/openapi-breaking.yml @@ -0,0 +1,155 @@ +# OpenAPI breaking-change gate (Wave 1 — UI↔API contract-drift gate). +# Design ref: docs/ci/01-CI-INTEGRATION-DESIGN.md (Wave 1, HIGHEST ROI). +# +# Why this exists: the UI↔API contract lives as three unreconciled hand- +# maintained copies (api handlers → instanode-web/src/api/types.ts → +# Playwright mock fixtures). A backend field/enum/status rename passes the +# api's own unit tests AND the web's tsc+vitest (because the web mirrors the +# old shape by hand) and then breaks prod at runtime — this is the class that +# broke login for ~24h (AUTH-004). +# +# This gate catches the api half: it diffs the PR's openapi.snapshot.json +# against the BASE branch's committed snapshot using oasdiff and FAILS the PR +# on any BREAKING change: +# - a removed or renamed response field / schema property +# - a narrowed / changed property type +# - a removed enum value +# - a removed endpoint or response code +# - a newly-required request field / parameter +# Non-breaking ADDITIONS (a new optional field, a new endpoint, a new enum +# value on a response) pass — the web client tolerates those. +# +# The companion gate on the producer side is openapi-snapshot.yml, which +# guarantees openapi.snapshot.json is regenerated whenever openapi.go changes +# (so the snapshot diffed here is always faithful to the handlers). The +# companion gate on the CONSUMER side lives in instanode-web: openapi-typescript +# codegen makes the same rename fail `tsc` in `npm run gate`. Together they +# move contract drift from a runtime prod break to a compile/PR-time failure. +# +# How to ship an INTENTIONAL breaking change: this gate failing is the signal, +# not a wall. Land the api change AND update the consumer (instanode-web +# regenerates src/api/generated.ts from the new snapshot; its tsc will red at +# every UI site using the removed/renamed field until they're fixed). Per rule +# 22 (contract-surface checklist) a breaking contract change touches all +# surfaces in one change. If a break is genuinely intentional and the consumer +# is updated in lockstep, override is by repo-admin merge with the consumer PR +# linked — there is deliberately no in-workflow allowlist, so every override is +# a human decision recorded on the PR. +# +# Observability (rule 25): one structured log line per run so the GitHub +# Actions → NR forwarder can chart the rate at which breaking changes are +# caught at PR time (instanode-reliability tile: "Breaking contract changes +# caught at PR time, 30d"). + +name: openapi-breaking + +on: + pull_request: + branches: [master] + paths: + - 'openapi.snapshot.json' + - '.github/workflows/openapi-breaking.yml' + workflow_dispatch: + +concurrency: + group: openapi-breaking-${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true + +# oasdiff version is PINNED — a floating version could change the breaking- +# change ruleset under us and either silently let a break through (bad) or +# start failing on a previously-allowed shape (noise). Bump deliberately. +env: + OASDIFF_VERSION: v1.11.7 + +jobs: + breaking-diff: + runs-on: ubuntu-latest + steps: + - name: Checkout PR head + uses: actions/checkout@v6 + with: + fetch-depth: 0 + + - name: Fetch base branch snapshot + id: base + env: + # Quoted into an env var (not interpolated into the script body) per + # GitHub Actions injection guidance. base.ref is a branch name, but + # this keeps the run block free of ${{ }} interpolation. + BASE: ${{ github.event.pull_request.base.ref || 'master' }} + run: | + git fetch origin "${BASE}" --depth=1 + # Extract the base branch's committed snapshot to a temp file. If the + # base has no snapshot yet (first introduction of the file), there is + # nothing to break against — treat as a clean pass. + if git show "origin/${BASE}:openapi.snapshot.json" > /tmp/base.openapi.snapshot.json 2>/dev/null; then + echo "base_has_snapshot=true" >> "$GITHUB_OUTPUT" + else + echo "base_has_snapshot=false" >> "$GITHUB_OUTPUT" + echo "::notice::Base branch has no openapi.snapshot.json — nothing to diff against; passing." + fi + + - name: Install oasdiff (pinned) + if: steps.base.outputs.base_has_snapshot == 'true' + run: | + set -euo pipefail + curl -sSfL "https://github.com/oasdiff/oasdiff/releases/download/${OASDIFF_VERSION}/oasdiff_${OASDIFF_VERSION#v}_linux_amd64.tar.gz" \ + | tar -xz -C /tmp oasdiff + sudo install /tmp/oasdiff /usr/local/bin/oasdiff + oasdiff --version + + - name: Diff for breaking changes (human-readable) + if: steps.base.outputs.base_has_snapshot == 'true' + run: | + echo "### Breaking-change diff (base → PR)" + # Non-failing pass first so the full breaking-change report appears in + # the log even when the gate fails below. + oasdiff breaking /tmp/base.openapi.snapshot.json openapi.snapshot.json || true + + - name: Emit observability line (rule 25) + if: steps.base.outputs.base_has_snapshot == 'true' + id: detect + run: | + set +e + # --fail-on WARN (NOT ERR). This is deliberate and load-bearing: + # + # The api emits most response fields as OPTIONAL (no `required` array + # on response schemas — see openapi.go). oasdiff classifies REMOVING + # an optional RESPONSE property as WARN, not ERR. But removing an + # optional response field IS the exact class that broke login: the + # UI reads `me.tier` / `r.storage_bytes`, the api drops/renames it, + # the UI silently gets `undefined`. `--fail-on ERR` would let that + # slip through (verified locally: removing AuthMeResponse.tier is a + # WARN). `--fail-on WARN` catches it AND still PASSES pure additions: + # - adding an optional response/request field → INFO → pass + # - adding a new endpoint → INFO → pass + # - removing an enum value from a RESPONSE → INFO → pass + # (server sends fewer values; the consumer's union still accepts) + # while FAILING the genuinely-breaking shapes: + # - removing/renaming a response field → WARN → fail + # - adding a new REQUIRED request field → ERR → fail + # - removing an endpoint / response code → ERR → fail + oasdiff breaking /tmp/base.openapi.snapshot.json openapi.snapshot.json --fail-on WARN >/dev/null 2>&1 + BREAKING=$? + set -e + if [ "$BREAKING" -ne 0 ]; then DETECTED=true; else DETECTED=false; fi + echo "detected=${DETECTED}" >> "$GITHUB_OUTPUT" + printf '{"event":"openapi_breaking_change","detected":%s,"repo":"api","pr":"%s","sha":"%s"}\n' \ + "${DETECTED}" \ + "${{ github.event.pull_request.number || 'none' }}" \ + "${GITHUB_SHA:0:7}" + + - name: Fail on breaking change + if: steps.base.outputs.base_has_snapshot == 'true' && steps.detect.outputs.detected == 'true' + run: | + echo "::error::This PR makes a BREAKING change to the OpenAPI contract (removed/renamed/retyped field, removed enum value, removed endpoint, or newly-required request field)." + echo "::error::The UI (instanode-web) and SDKs mirror this contract. A breaking change here breaks prod at runtime unless the consumer is updated in lockstep (rule 22)." + echo "::error::If this break is INTENTIONAL: update instanode-web (regenerate src/api/generated.ts from the new snapshot; its tsc will red at every UI site using the removed field) in a linked PR, then merge via repo-admin with both PRs cross-referenced." + echo "" + echo "Breaking changes detected (WARN+ severity):" + oasdiff breaking /tmp/base.openapi.snapshot.json openapi.snapshot.json --severity-levels warn,error || true + exit 1 + + - name: No breaking changes + if: steps.base.outputs.base_has_snapshot == 'true' && steps.detect.outputs.detected == 'false' + run: echo "No breaking OpenAPI changes vs base branch — additive-only or no contract change. Consumers can regenerate clients safely." diff --git a/docs/OPENAPI-CONTRACT-GATES.md b/docs/OPENAPI-CONTRACT-GATES.md new file mode 100644 index 00000000..960650b1 --- /dev/null +++ b/docs/OPENAPI-CONTRACT-GATES.md @@ -0,0 +1,74 @@ +# OpenAPI contract gates + +The UI↔API contract is the single most expensive thing to get wrong here: it +lives as three hand-maintained copies (api handlers → `instanode-web`'s +`src/api/types.ts` → Playwright mock fixtures). A backend field/enum/status +rename passes the api's own unit tests AND the web's `tsc`+`vitest` (the web +mirrors the *old* shape by hand) and then breaks prod at runtime. That is the +class that broke login for ~24h (AUTH-004). + +Two CI gates make that drift a compile/PR-time failure instead of a prod break. + +## 1. `openapi-snapshot.yml` — producer freshness gate + +`openapi.snapshot.json` is the committed OpenAPI 3.1 spec, generated from +`internal/handlers/openapi.go` by `cmd/openapi-snapshot`. This gate regenerates +it on every PR that touches the handlers/snapshot and **fails if the committed +file is stale**. So the snapshot is always faithful to the live handlers. + +Regenerate locally: + +```sh +make openapi-snapshot # writes openapi.snapshot.json +``` + +## 2. `openapi-breaking.yml` — breaking-change gate (Wave 1) + +Diffs the PR's `openapi.snapshot.json` against the **base branch's** committed +snapshot using [`oasdiff`](https://github.com/oasdiff/oasdiff) (pinned version) +and **fails the PR on any breaking change**: + +| Change | oasdiff severity | Gate | +|---|---|---| +| Remove / rename a response field | WARN | **fails** | +| Narrow / change a property type | WARN/ERR | **fails** | +| Add a new **required** request field | ERR | **fails** | +| Remove an endpoint or response code | ERR | **fails** | +| Add an optional response/request field | INFO | passes | +| Add a new endpoint | INFO | passes | +| Add a new enum value to a response | INFO | passes | +| Remove an enum value from a **response** | INFO | passes (not breaking for a consumer) | + +The gate runs `oasdiff breaking --fail-on WARN`. + +### Why `--fail-on WARN`, not `--fail-on ERR` + +The api emits most response fields as **optional** (no `required` array on +response schemas). oasdiff classifies removing an optional *response* property +as **WARN**, not ERR — but that removal is exactly the login-break class (the UI +reads `me.tier`, the api drops it, the UI silently gets `undefined`). +`--fail-on ERR` would let it through. `--fail-on WARN` catches it while still +passing pure additions. Verified locally: + +```sh +# breaking: remove AuthMeResponse.tier → exit 1 +oasdiff breaking base.json removed-tier.json --fail-on WARN; echo $? # 1 +# additive: add an optional field → exit 0 +oasdiff breaking base.json added-field.json --fail-on WARN; echo $? # 0 +``` + +## Shipping an intentional breaking change + +The gate failing is the **signal**, not a wall. Per rule 22 (contract-surface +checklist) a breaking contract change touches all surfaces in one change: + +1. Land the api change (regenerate the snapshot with `make openapi-snapshot`). +2. In `instanode-web`, regenerate `src/api/generated.ts` from the new snapshot + (`npm run gen:api-types`). Its `tsc` will red at every UI site using the + removed/renamed field — fix them in the same PR. +3. Merge via repo-admin with the two PRs cross-referenced. There is + deliberately no in-workflow allowlist, so every breaking override is a + recorded human decision on the PR. + +The consumer-side gate (`instanode-web`'s `gen:api-types` + `tsc`) and this +producer-side gate together move contract drift from runtime to compile time.