diff --git a/designs/golden_templates_and_semver_check.md b/designs/golden_templates_and_semver_check.md new file mode 100644 index 00000000000..2d9cbc3d920 --- /dev/null +++ b/designs/golden_templates_and_semver_check.md @@ -0,0 +1,548 @@ +Title: Golden Templates and Semver Check +======================================== + +What is the problem? +-------------------- + +SAM CLI processes user templates and writes the result to disk for +CloudFormation to consume. The output template is the contract with users +and their tooling — yet the existing test suite has no review surface for +that output's *shape*. Unit tests verify individual functions in +isolation; integration tests assert exit codes and a few narrow +properties (`Resources.X.Type == ...`); neither puts the rendered +template in front of a human reviewer. + +This shows up most painfully in feature PRs that introduce new +output-shaping behavior. The reviewer can read the code and run the +tests, but cannot easily see *what the feature produces for representative +inputs* — the diff in the PR is `samcli/lib/...` Python code, not the +YAML that customers will deploy. Subtle interactions between pipeline +stages — LE expansion vs. SAM transform vs. global-transform export +vs. artifact-path merge — are invisible at review time. + +PR #8637 (sam-cli 1.160.0, in-tree CFN Language Extensions support) is +exhibit A. It introduced large new output-shaping behavior across the +build and package paths. Within weeks of release four bugs surfaced, +each a behavior shift visible in the post-build or post-package +template that no automated test caught before release: + +- #9004 — `Fn::FindInMap` / `Join` / `Select` / `Base64` raised on + unresolved parameter `Ref`s in PARTIAL mode (fixed in #9010). +- #9005 — `PACKAGEABLE_RESOURCE_ARTIFACT_PROPERTIES` had drifted from + canonical resource lists; `AWS::Serverless::Application` and seven + sibling types failed packaging (fixed in #9009). +- #9027 — `AWS::Include` buried inside `Fn::ToJsonString` was invisible + to the global-transform walker because LE expansion ran first (fixed + in #9030). +- #9029 — LE-aware build merge overwrote `Code: { ZipFile: !Sub ... }` + with the LE-resolved value, baking pseudo-param defaults into the + built template (fixed in #9031). + +PR #9033 made local LE processing opt-in to give customers a clean +migration path. + +These bugs didn't slip past #8637 because the test suite was small — +the LE module shipped with comprehensive unit tests and integration +tests. They slipped past because nobody was asked, at PR-review time, +to look at the actual templates `sam build` and `sam package` produce +for templates exercising each new LE intrinsic. Each bug is a YAML-shape +diff that a reviewer would have caught on sight — if the diff had been +visible. + +The right moment for that review surface is the feature PR itself. +#8637 should have shipped with one pinned input/output pair per LE +intrinsic plus the cross-cutting interactions (LE × nested stacks, LE +× AWS::Include, LE × Globals). Each iteration on the PR would have +produced output diffs alongside code diffs; the four bugs would have +appeared as visible shape changes during development, before the PR +merged. That same pattern then carries forward: every future feature +PR pins what it produces, every patch release is auditable for +behavior changes. + +A pinned-output corpus solves only half the problem, though. Behavior +changes — intentional or not — must also be visible to *consumers* of +SAM CLI, not just to reviewers. The most impactful changes don't just +alter output shape: they cause `sam build` or `sam package` to fail +outright, breaking customer CI/CD pipelines on what was assumed to be +a routine upgrade. #9004 is the canonical example — `sam build` +started raising `InvalidTemplateException` on any template that used +`!FindInMap` over a parameter without a `Default`, blocking every CI +that consumed such a template until the customer either pinned to +1.159.x or upgraded past the fix. The 1.159 → 1.160 bump was minor +per semver, signalling "additive, backwards-compatible" — but local +LE processing turning on by default broke working pipelines, which +is the definition of a backwards-incompatible change. The version +number didn't tell consumers to audit their pipelines, so they +didn't. + +This design proposes: + +1. **A golden-template corpus** under `tests/golden/templates/`. + For each representative input, pin the post-build and post-package + YAML byte-for-byte. The pins live in the repo; PR diffs include + both the code change and the resulting output change. Reviewers see + what users will see. + +2. **A CI semver gate** that requires a major-version bump in + `samcli/__init__.py` whenever an existing pinned output is modified + or deleted. New pins (corpus growth) are ungated. The gate makes + user-visible behavior changes explicit at version-number granularity + so downstream tooling can react deliberately. + +The two compose: goldens make output changes *visible* to reviewers; +the semver check makes them *commitments* to consumers. + +What will be changed? +--------------------- + +Add a new `tests/golden/` corpus and harness (per-PR unit tier), a daily +subprocess-driven tier under `tests/integration/golden_templates/`, and a +tiny standalone GitHub workflow that enforces the semver rule. The corpus +covers both LE and non-LE templates across four axes: SAM resources, +packageable CFN resources, LE intrinsics, and cross-cutting features. + +The two tiers share `tests/golden/templates/` as their single corpus +source — the integration tier reads case directories straight from the +unit tier so authoring a new case lands in both places at once. + +Success criteria for the change +------------------------------- + +1. Every PR runs the corpus through an in-process build + package pipeline + and fails on any byte-exact diff against pinned `expected.{build,package}.yaml`. +2. The daily integration-tests workflow runs the same corpus through real + `sam` subprocess invocations and structurally compares the output. +3. Modifying or deleting an existing pin in a PR fails CI unless the same PR + bumps the major version in `samcli/__init__.py`. Adding new pins is + ungated. +4. A single regeneration script (`update_goldens.py`) is the canonical way + to re-pin; the regenerator and the verifier share the harness so they + cannot diverge. +5. The corpus is built incrementally as a series of independently-reviewable + PRs, each ≤1000 LOC. + +Out-of-Scope +------------ + +- No deploy step. The integration tier stops at packaged-template output + — no CloudFormation, no real Lambdas. +- No CFN compatibility check (e.g. `cfn-lint`, `ValidateTemplate`) at this + stage. The corpus pins what *SAM-CLI* produces; CFN-side parity is a + separate exercise. +- No coverage of `sam local` runtime invocation paths. The bugs we're + guarding against are template-shape bugs, not runtime bugs. +- No reorganization of existing + `tests/integration/testdata/buildcmd/language-extensions-*` / + `tests/integration/testdata/package/language-extensions-*` directories. + The new corpus lives separately. + +User Experience Walkthrough +--------------------------- + +A SAM-CLI contributor authoring a behavior-affecting change runs the existing +test suite and discovers one or more golden mismatches: + +``` +$ make test-all +... +tests/golden/test_corpus.py::test_build_output_matches_golden[language_extensions/foreach_static] FAILED + +Golden mismatch in language_extensions/foreach_static. +To inspect: python tests/golden/update_goldens.py --diff --filter 'language_extensions/foreach_static' +To re-pin: python tests/golden/update_goldens.py --filter 'language_extensions/foreach_static' +If intentional: + - new case (added expected.*.yaml) -> no version bump at PR time + - modified/deleted existing expected.* -> bump major in samcli/__init__.py +``` + +The contributor inspects the diff: + +``` +$ python tests/golden/update_goldens.py --diff --filter 'language_extensions/foreach_static' +--- a/language_extensions/foreach_static/expected.build.yaml ++++ b/language_extensions/foreach_static/expected.build.yaml +@@ -... +- Code: <> ++ Code: ++ ZipFile: ... +``` + +If the diff is intentional, the contributor re-pins, bumps the major version, +and ships. If unintentional, they fix their code and the test passes again. + +Authoring a new case follows the same flow with `--new`: + +``` +$ python tests/golden/update_goldens.py --new --filter 'sam_resources/' +``` + +Implementation +============== + +CLI Changes +----------- + +No `sam` CLI changes. New developer-facing CLIs: + +- `python tests/golden/update_goldens.py [--filter GLOB] [--check | --diff | --new]` +- `python tests/golden/check_semver_bump.py --base REF --head REF` + +### Breaking Change + +None. + +Design +------ + +### Repository layout + +``` +tests/golden/ # NEW +├── conftest.py # parametrize over corpus dir +├── harness.py # in-process build + package + diff +├── normalize.py # deterministic YAML rendering +├── update_goldens.py # regeneration CLI +├── check_semver_bump.py # semver gate +├── README.md # author workflow +└── templates/ + ├── sam_resources//... # SAM resource axis + ├── packageable_resources//... # CFN resource axis + ├── language_extensions//... # LE axis + └── cross_cutting//... # nested stacks, AWS::Include, etc. + +tests/integration/golden_templates/ # NEW +├── test_golden_subprocess.py # imports tests/golden/templates as data +└── structural_compare.py # looser comparator for real-CLI output +``` + +The `tests/golden/templates/` directory is the single source of truth for +the corpus. Both the unit tier (`tests/golden/`) and the integration tier +(`tests/integration/golden_templates/`) read from it; nothing is duplicated. + +The integration tier sits alongside the existing +`tests/integration/buildcmd/`, `tests/integration/package/`, and +`tests/integration/deploy/` suites. ``tests/regression/`` is reserved for +``sam`` vs ``aws cloudformation`` parity tests — a different semantic +that compares SAM CLI output against the AWS CLI's current output — +which is not what the golden corpus asserts. Goldens compare SAM CLI +against its own pinned past behavior, so they belong under +``tests/integration/``. + +### Per-case directory layout + +``` +templates/// +├── template.yaml # input +├── metadata.yaml # language_extensions, description, issue_refs +├── expected.build.yaml # pinned post-build template +├── expected.package.yaml # pinned post-package template +├── README.md # what this case covers and why +└── src/ # any source files referenced by CodeUri etc. +``` + +### The harness + +Three functions, used by both pytest and `update_goldens.py`: + +**`run_build_pipeline(template_path, language_extensions: bool) -> dict`** +mirrors `sam build`: + +1. Read `template.yaml`. +2. Call `expand_language_extensions(template, enabled=language_extensions)` + (the function PR #9033 made explicit). +3. Run the SAM transform. +4. Walk artifact properties using the JMESPath-aware machinery from PR #9009 + and replace any local artifact path with `<>`. +5. Apply the LE-aware merge with a stub artifact-set that mirrors what + `ApplicationBuilder` would have produced (the three sites fixed in #9031). + +**`run_package_pipeline(template_path, build_output) -> dict`** +mirrors `sam package`: + +1. Run `_export_global_artifacts_pass` on the original template (the pre-LE + AWS::Include pass added in PR #9030). +2. Walk packageable artifact properties and replace each local path with + `s3://golden-bucket/` — deterministic, content-addressed. +3. Re-run `_export_global_artifacts_pass` to catch AWS::Include nested in + the LE-expanded template. + +S3 is mocked via a thin `GoldenS3Uploader` class — no `moto`, no network, +no `boto3` stubbing. + +**`normalize(template) -> str`** renders deterministic YAML: + +- Sort `Resources` and `Mappings` keys alphabetically. +- Drop volatile `Metadata.SamTransformMetrics`. +- `yaml.safe_dump(..., sort_keys=True, default_flow_style=False)`. +- Single trailing newline. + +The harness invokes the same high-level functions the CLI invokes; it does +not re-implement the pipeline. The integration tier verifies that contract +holds. + +### Pytest collection + +A single parametrize collects every case directory: + +```python +def pytest_generate_tests(metafunc): + if "golden_case" in metafunc.fixturenames: + cases = sorted(p.parent for p in TEMPLATES_ROOT.rglob("template.yaml")) + metafunc.parametrize( + "golden_case", + cases, + ids=lambda p: str(p.relative_to(TEMPLATES_ROOT)), + ) +``` + +Each case becomes one parametrized test ID like +`language_extensions/foreach_static`, so `pytest -k foreach_static` works. + +Two tests per case, one for build output and one for package output. Both +diff against the pinned file and emit the actionable hint shown in the UX +walkthrough on mismatch. + +### Per-case `metadata.yaml` + +Optional, defaults applied if absent: + +```yaml +language_extensions: true +description: "Repro for #9027 — AWS::Include inside Fn::ToJsonString" +issue_refs: [9027] +``` + +Default for `language_extensions`: `true` for cases under +`templates/language_extensions/`, `false` elsewhere — matching the +post-#9033 opt-in default. + +### The regeneration script + +Single CLI, thin wrapper around the same harness functions the tests use: + +- *(default, no args)* — regenerate every case. +- `--filter ` — regenerate only matching cases. +- `--check` — dry-run; exit non-zero if anything would change. CI mode. +- `--diff` — like `--check`, also print unified diff of each would-be + change. +- `--new` — for cases with `template.yaml` but no `expected.*` yet, + generate both pinned outputs from scratch. + +By construction the regenerator and the verifier share the harness, so they +can never diverge. + +### The semver gate + +Tiny standalone GitHub workflow at +`.github/workflows/golden-semver-gate.yml`. Runs on every PR *and* on +every merge_group event; the script itself returns exit 0 when no +corpus pins changed, so it's cheap to run unconditionally and posts a +definitive status on every event (which lets branch protection mark +it as a required check without blocking unrelated PRs or merge-queue +runs): + +```yaml +on: + pull_request: + branches: [develop, "feat/*", "feat-*"] + merge_group: + types: [checks_requested] + branches: [develop, "feat/*", "feat-*"] +``` + +The workflow avoids two flavors of the "skipped but required check" +pitfall: + +1. **No `on.pull_request.paths:` filtering.** Path filtering at the + trigger level skips the workflow entirely on PRs that touch no + matching path — no status posts, and a required check that never + reports gets treated by branch protection as missing/pending. + Self-gating in the script (return 0 when there are no relevant + changes) is the simpler fix and keeps the check eligible to be + required. + +2. **Both `pull_request` and `merge_group` triggers.** The repo uses + GitHub's merge queue; when a PR enters the queue, GitHub + dispatches a `merge_group` event against an ephemeral ref, not + `pull_request`. A required check that only listens on + `pull_request` never reports a status on the `merge_group` event, + so the merge queue treats it as missing/pending — the same failure + mode under a different trigger. Mirror `build.yml`'s pattern by + listening on both events with the same branch filters. The + workflow then resolves the target branch from `github.base_ref` on + PR events and from `github.event.merge_group.base_ref` (stripping + the `refs/heads/` prefix) on merge-queue events. + +`check_semver_bump.py` rules: + +1. Diff merge base vs HEAD; collect changes to + `tests/golden/templates/**/expected.*.yaml`. +2. Classify each: + - **Added** — no version requirement at PR time. + - **Modified** or **Deleted** — requires major bump in this same PR. +3. Read `__version__` from `samcli/__init__.py` on base and head. +4. If any modification or deletion exists, require `head.major > base.major`. +5. On failure, print the changed goldens, current vs. required version, and + the suggested new version string. + +A rename (delete-old + add-new) is treated as one deletion + one addition → +requires a major bump. Conservative on purpose: a rename usually means +reorganization but could mask a content edit, so we make the human +acknowledge it. + +#### Why additions don't require a per-PR minor bump + +The project's existing release process bumps the version in a dedicated +release PR (e.g. #9051 `chore: bump version to 1.161.1`). Forcing a minor +bump on every corpus-addition PR would either churn that release process +or require funnelling the corpus rollout through a single feature branch. +Neither is worth the friction. The gate enforces the only invariant that +actually matters at PR time: *a behavior change to an existing pin must +ship as a major version*. New cases simply broaden coverage and have no +backwards-compatibility implication. + +### Per-PR CI wiring + +`make pr` already invokes `pytest tests/unit/...`. We extend the `test` and +`test-all` Makefile targets to include `tests/golden/` as a separate pytest +invocation (so it doesn't enter the 94% coverage calculation against +`samcli/`). The harness is in-process, deterministic, and offline — +estimated <30s for ~50 cases. The existing Windows `TEMP=D:\Temp` shim +already handles temp-dir contention. No new workflow file (the semver gate +above is its own workflow, but the corpus tests ride on `build.yml`). + +### Integration tier + +Daily run on the existing `integration-tests.yml` workflow's matrix +(one new entry: `golden-templates`). The suite sits alongside the +existing `tests/integration/buildcmd/`, `tests/integration/package/`, +and `tests/integration/deploy/` directories — all subprocess-driven +``sam`` invocations against real AWS resources. + +We deliberately do **not** put this under ``tests/regression/``. The +existing ``tests/regression/`` suites are ``sam`` vs +``aws cloudformation`` parity tests — they assert SAM CLI's output +matches AWS CLI's *current* output. Goldens assert the opposite: SAM +CLI's output matches its own *pinned past* output. Different semantics; +sharing a directory would invite future contributors to confuse the +two. + +Per case: + +1. Copy case dir to a temp working dir; copy `src/`. +2. `sam build [--language-extensions]` subprocess; assert exit 0; read + `.aws-sam/build/template.yaml`. +3. `sam package --s3-bucket --output-template-file packaged.yaml` + subprocess; assert exit 0; read `packaged.yaml`. +4. Structurally compare against the pinned `expected.{build,package}.yaml`: + - Same set of resource logical IDs. + - Same `Type` per ID. + - Same set of property keys at each path. + - Same `Outputs` and `Mappings` keys. + - For LE-relevant artifact properties (`Code`, `CodeUri`, `ContentUri`, + `DefinitionUri`, `Location`, `Command.ScriptLocation`, etc.) flag + dict-vs-string differences (the #9029 bug class). + +The comparator lives in +`tests/integration/golden_templates/structural_compare.py` and is reused +for both build and package outputs. + +#### Why structural, not byte-exact + +Real `sam build` writes timestamps, real S3 URIs, real artifact hashes. +Byte-exact pinning at this tier would either require the same normalization +the in-process tier uses (defeats the purpose — we want to see the real +output) or constant churn from environmental noise. Structural comparison +answers "did the user-visible output shape change?" — which is what catches +the four bugs we just fixed. + +If the unit tier passes but the integration tier fails for a case, the +in-process harness has drifted from the CLI surface — investigation, not +a re-pin. + +### Wave plan + +Each PR is independently reviewable; LOC estimates are upper-bound net diff. + +- **PR 0** (this design doc). +- **PR 1** — Harness skeleton, three sentinel cases, semver gate workflow. +- **PR 2** — Integration-tier subprocess runner under + `tests/integration/golden_templates/`. Adds a new `golden-templates` + entry to the existing `integration-tests.yml` matrix. +- **PR 3** — SAM resources axis: Function variants (~10 cases). +- **PR 4** — SAM resources axis: Api / HttpApi / StateMachine / others + (~10 cases). +- **PR 5** — Packageable CFN resources axis (~12 cases). One per entry in + `RESOURCES_WITH_LOCAL_PATHS` + `RESOURCES_WITH_IMAGE_COMPONENT` not + already covered. Catches #9009-class drift; locks every dotted artifact + path. +- **PR 6** — LE intrinsics: `Fn::ForEach` (~10 cases). +- **PR 7** — LE intrinsics: non-ForEach + bug repros (~10 cases). Each + metadata.yaml carries the issue ref. +- **PR 8** — Cross-cutting: nested stacks, Globals, Mappings, Conditions + (~10 cases). +- **PR 9** — Cross-cutting: AWS::Include, SAR Metadata, Unicode (~7 cases). + Final corpus README listing the full coverage matrix. + +#### Wave invariants + +- PRs 2–9 are additions only under `tests/golden/templates/`, so the + semver gate passes without any version bump. +- Every case carries `metadata.yaml` with `description` and `issue_refs`. +- No PR after PR 1 modifies harness behavior unless a wave reveals a + missing capability; if so, the harness change is its own follow-up PR. + +`samconfig.toml` Changes +---------------- + +None. + +Security Considerations +----------------------- + +The harness intentionally invokes only in-process SAM-CLI code; no +network calls, no credentials, no AWS API access. The integration tier +uses the existing integration-tests workflow's IAM role and S3 bucket; +no new permissions required. + +Documentation Changes +--------------------- + +`tests/golden/README.md` — authoring guide for contributors adding new +cases or re-pinning existing ones. + +Alternatives considered +======================= + +Subprocess-only harness (no in-process) +--------------------------------------- + +Every golden runs the real `sam` CLI as a subprocess. Rejected: process +startup × N templates is too slow for a per-PR gate; flakier on Windows; +deterministic package output without `moto` is hard. Subprocess invocation +is kept, but only at the daily integration tier where speed and +determinism matter less. + +Snapshot plugin (`syrupy`, `pytest-snapshot`) +--------------------------------------------- + +Less harness code to write. Rejected: opaque snapshot format hurts +code-review readability; update flow is plugin-specific; corpus growth +across many PRs is harder when snapshots are plugin-managed rather than +plain YAML files in the case directory. + +CFN-side validation (`cfn-lint` or `ValidateTemplate`) +------------------------------------------------------ + +Would catch SAM-CLI vs CFN divergence. Rejected for now: out of scope for +this design; can be added as a separate compatibility tier later. The +current bugs are SAM-CLI's own output-shape bugs, not CFN-divergence bugs. + +Per-PR minor bump on every corpus-addition PR +--------------------------------------------- + +Initially considered. Rejected: would either churn the existing +release-PR process (e.g. #9051) or require funnelling the corpus rollout +through a single feature branch. The gate's only PR-time invariant is +that *behavior changes* require major bumps; coverage growth is +backwards-compatible by definition.