Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions .github/workflows/openapi-breaking.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# OpenAPI breaking-change gate (Wave 1 — UI↔API contract-drift gate).
# Design ref: docs/ci/01-CI-INTEGRATION-DESIGN.md (Wave 1, HIGHEST ROI).
#
# Why this exists: the UI↔API contract lives as three unreconciled hand-
# maintained copies (api handlers → instanode-web/src/api/types.ts →
# Playwright mock fixtures). A backend field/enum/status rename passes the
# api's own unit tests AND the web's tsc+vitest (because the web mirrors the
# old shape by hand) and then breaks prod at runtime — this is the class that
# broke login for ~24h (AUTH-004).
#
# This gate catches the api half: it diffs the PR's openapi.snapshot.json
# against the BASE branch's committed snapshot using oasdiff and FAILS the PR
# on any BREAKING change:
# - a removed or renamed response field / schema property
# - a narrowed / changed property type
# - a removed enum value
# - a removed endpoint or response code
# - a newly-required request field / parameter
# Non-breaking ADDITIONS (a new optional field, a new endpoint, a new enum
# value on a response) pass — the web client tolerates those.
#
# The companion gate on the producer side is openapi-snapshot.yml, which
# guarantees openapi.snapshot.json is regenerated whenever openapi.go changes
# (so the snapshot diffed here is always faithful to the handlers). The
# companion gate on the CONSUMER side lives in instanode-web: openapi-typescript
# codegen makes the same rename fail `tsc` in `npm run gate`. Together they
# move contract drift from a runtime prod break to a compile/PR-time failure.
#
# How to ship an INTENTIONAL breaking change: this gate failing is the signal,
# not a wall. Land the api change AND update the consumer (instanode-web
# regenerates src/api/generated.ts from the new snapshot; its tsc will red at
# every UI site using the removed/renamed field until they're fixed). Per rule
# 22 (contract-surface checklist) a breaking contract change touches all
# surfaces in one change. If a break is genuinely intentional and the consumer
# is updated in lockstep, override is by repo-admin merge with the consumer PR
# linked — there is deliberately no in-workflow allowlist, so every override is
# a human decision recorded on the PR.
#
# Observability (rule 25): one structured log line per run so the GitHub
# Actions → NR forwarder can chart the rate at which breaking changes are
# caught at PR time (instanode-reliability tile: "Breaking contract changes
# caught at PR time, 30d").

name: openapi-breaking

on:
pull_request:
branches: [master]
paths:
- 'openapi.snapshot.json'
- '.github/workflows/openapi-breaking.yml'
workflow_dispatch:

concurrency:
group: openapi-breaking-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

# oasdiff version is PINNED — a floating version could change the breaking-
# change ruleset under us and either silently let a break through (bad) or
# start failing on a previously-allowed shape (noise). Bump deliberately.
env:
OASDIFF_VERSION: v1.11.7

jobs:
breaking-diff:
runs-on: ubuntu-latest
steps:
- name: Checkout PR head
uses: actions/checkout@v6
with:
fetch-depth: 0

- name: Fetch base branch snapshot
id: base
env:
# Quoted into an env var (not interpolated into the script body) per
# GitHub Actions injection guidance. base.ref is a branch name, but
# this keeps the run block free of ${{ }} interpolation.
BASE: ${{ github.event.pull_request.base.ref || 'master' }}
run: |
git fetch origin "${BASE}" --depth=1
# Extract the base branch's committed snapshot to a temp file. If the
# base has no snapshot yet (first introduction of the file), there is
# nothing to break against — treat as a clean pass.
if git show "origin/${BASE}:openapi.snapshot.json" > /tmp/base.openapi.snapshot.json 2>/dev/null; then
echo "base_has_snapshot=true" >> "$GITHUB_OUTPUT"
else
echo "base_has_snapshot=false" >> "$GITHUB_OUTPUT"
echo "::notice::Base branch has no openapi.snapshot.json — nothing to diff against; passing."
fi

- name: Install oasdiff (pinned)
if: steps.base.outputs.base_has_snapshot == 'true'
run: |
set -euo pipefail
curl -sSfL "https://github.com/oasdiff/oasdiff/releases/download/${OASDIFF_VERSION}/oasdiff_${OASDIFF_VERSION#v}_linux_amd64.tar.gz" \
| tar -xz -C /tmp oasdiff
sudo install /tmp/oasdiff /usr/local/bin/oasdiff
oasdiff --version

- name: Diff for breaking changes (human-readable)
if: steps.base.outputs.base_has_snapshot == 'true'
run: |
echo "### Breaking-change diff (base → PR)"
# Non-failing pass first so the full breaking-change report appears in
# the log even when the gate fails below.
oasdiff breaking /tmp/base.openapi.snapshot.json openapi.snapshot.json || true

- name: Emit observability line (rule 25)
if: steps.base.outputs.base_has_snapshot == 'true'
id: detect
run: |
set +e
# --fail-on WARN (NOT ERR). This is deliberate and load-bearing:
#
# The api emits most response fields as OPTIONAL (no `required` array
# on response schemas — see openapi.go). oasdiff classifies REMOVING
# an optional RESPONSE property as WARN, not ERR. But removing an
# optional response field IS the exact class that broke login: the
# UI reads `me.tier` / `r.storage_bytes`, the api drops/renames it,
# the UI silently gets `undefined`. `--fail-on ERR` would let that
# slip through (verified locally: removing AuthMeResponse.tier is a
# WARN). `--fail-on WARN` catches it AND still PASSES pure additions:
# - adding an optional response/request field → INFO → pass
# - adding a new endpoint → INFO → pass
# - removing an enum value from a RESPONSE → INFO → pass
# (server sends fewer values; the consumer's union still accepts)
# while FAILING the genuinely-breaking shapes:
# - removing/renaming a response field → WARN → fail
# - adding a new REQUIRED request field → ERR → fail
# - removing an endpoint / response code → ERR → fail
oasdiff breaking /tmp/base.openapi.snapshot.json openapi.snapshot.json --fail-on WARN >/dev/null 2>&1
BREAKING=$?
set -e
if [ "$BREAKING" -ne 0 ]; then DETECTED=true; else DETECTED=false; fi
echo "detected=${DETECTED}" >> "$GITHUB_OUTPUT"
printf '{"event":"openapi_breaking_change","detected":%s,"repo":"api","pr":"%s","sha":"%s"}\n' \
"${DETECTED}" \
"${{ github.event.pull_request.number || 'none' }}" \
"${GITHUB_SHA:0:7}"

- name: Fail on breaking change
if: steps.base.outputs.base_has_snapshot == 'true' && steps.detect.outputs.detected == 'true'
run: |
echo "::error::This PR makes a BREAKING change to the OpenAPI contract (removed/renamed/retyped field, removed enum value, removed endpoint, or newly-required request field)."
echo "::error::The UI (instanode-web) and SDKs mirror this contract. A breaking change here breaks prod at runtime unless the consumer is updated in lockstep (rule 22)."
echo "::error::If this break is INTENTIONAL: update instanode-web (regenerate src/api/generated.ts from the new snapshot; its tsc will red at every UI site using the removed field) in a linked PR, then merge via repo-admin with both PRs cross-referenced."
echo ""
echo "Breaking changes detected (WARN+ severity):"
oasdiff breaking /tmp/base.openapi.snapshot.json openapi.snapshot.json --severity-levels warn,error || true
exit 1

- name: No breaking changes
if: steps.base.outputs.base_has_snapshot == 'true' && steps.detect.outputs.detected == 'false'
run: echo "No breaking OpenAPI changes vs base branch — additive-only or no contract change. Consumers can regenerate clients safely."
74 changes: 74 additions & 0 deletions docs/OPENAPI-CONTRACT-GATES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# OpenAPI contract gates

The UI↔API contract is the single most expensive thing to get wrong here: it
lives as three hand-maintained copies (api handlers → `instanode-web`'s
`src/api/types.ts` → Playwright mock fixtures). A backend field/enum/status
rename passes the api's own unit tests AND the web's `tsc`+`vitest` (the web
mirrors the *old* shape by hand) and then breaks prod at runtime. That is the
class that broke login for ~24h (AUTH-004).

Two CI gates make that drift a compile/PR-time failure instead of a prod break.

## 1. `openapi-snapshot.yml` — producer freshness gate

`openapi.snapshot.json` is the committed OpenAPI 3.1 spec, generated from
`internal/handlers/openapi.go` by `cmd/openapi-snapshot`. This gate regenerates
it on every PR that touches the handlers/snapshot and **fails if the committed
file is stale**. So the snapshot is always faithful to the live handlers.

Regenerate locally:

```sh
make openapi-snapshot # writes openapi.snapshot.json
```

## 2. `openapi-breaking.yml` — breaking-change gate (Wave 1)

Diffs the PR's `openapi.snapshot.json` against the **base branch's** committed
snapshot using [`oasdiff`](https://github.com/oasdiff/oasdiff) (pinned version)
and **fails the PR on any breaking change**:

| Change | oasdiff severity | Gate |
|---|---|---|
| Remove / rename a response field | WARN | **fails** |
| Narrow / change a property type | WARN/ERR | **fails** |
| Add a new **required** request field | ERR | **fails** |
| Remove an endpoint or response code | ERR | **fails** |
| Add an optional response/request field | INFO | passes |
| Add a new endpoint | INFO | passes |
| Add a new enum value to a response | INFO | passes |
| Remove an enum value from a **response** | INFO | passes (not breaking for a consumer) |

The gate runs `oasdiff breaking <base> <pr> --fail-on WARN`.

### Why `--fail-on WARN`, not `--fail-on ERR`

The api emits most response fields as **optional** (no `required` array on
response schemas). oasdiff classifies removing an optional *response* property
as **WARN**, not ERR — but that removal is exactly the login-break class (the UI
reads `me.tier`, the api drops it, the UI silently gets `undefined`).
`--fail-on ERR` would let it through. `--fail-on WARN` catches it while still
passing pure additions. Verified locally:

```sh
# breaking: remove AuthMeResponse.tier → exit 1
oasdiff breaking base.json removed-tier.json --fail-on WARN; echo $? # 1
# additive: add an optional field → exit 0
oasdiff breaking base.json added-field.json --fail-on WARN; echo $? # 0
```

## Shipping an intentional breaking change

The gate failing is the **signal**, not a wall. Per rule 22 (contract-surface
checklist) a breaking contract change touches all surfaces in one change:

1. Land the api change (regenerate the snapshot with `make openapi-snapshot`).
2. In `instanode-web`, regenerate `src/api/generated.ts` from the new snapshot
(`npm run gen:api-types`). Its `tsc` will red at every UI site using the
removed/renamed field — fix them in the same PR.
3. Merge via repo-admin with the two PRs cross-referenced. There is
deliberately no in-workflow allowlist, so every breaking override is a
recorded human decision on the PR.

The consumer-side gate (`instanode-web`'s `gen:api-types` + `tsc`) and this
producer-side gate together move contract drift from runtime to compile time.
Loading