Skip to content

Scenario contract enforcement: build-time guards + single-sourced CM name #341

@bdchatham

Description

@bdchatham

Problem

The seitask workstream has surfaced a recurring failure pattern: contract drift between the seitask binary's internal helpers (`WorkflowVarsName`, scheme registration, downward-API env contract) and the scenario YAML / RBAC layer that has to mirror them manually. Each of the last four PRs (#334, #337, #339, plus the in-flight #339 build-time tests) addressed a different facet of the same shape: an internal helper has a convention, the scenario author has to mirror it manually in YAML, no test catches the drift, the bug surfaces only at first cluster fire ~10 minutes into the run.

Platform-engineer (cross-review on #339):

"The scenario YAML is the integration contract between three things (the runtime binary, the chaos-mesh CR shape, the wrapper's envsubst inputs) and none of them validate it. Each bug surfaced at first cluster fire."

Impact

  • Slow feedback loop. Each contract bug costs ~10–30 min of manual-fire + investigation + fix-PR + image rebuild + SCENARIO_REF bump + re-fire. We've done this loop four times in the last hour to get the harness past keygen.
  • Compounds with scenario count. Adding a second/third scenario will repeat the contract surface from scratch. Without enforcement, each new scenario brings its own fix(scenarios/release-test): align workflow-vars CM name with seitask convention #337-class bugs.
  • Build-time enforcement is cheap. Two narrow tests added in fix(seitask): register sei.io scheme + grant workflownodes RBAC #339 already catch the two highest-frequency classes (scheme registration + CM-name drift) at `go test`. There's more we could enforce; this issue tracks the broader pattern.

Proposed approach

Three reinforcing layers, deferred-ranked by effort:

Layer 1 (already partially in #339): unit tests for internal contracts

  • ✅ Scheme round-trip test for every typed CR provision-snd / keygen / upload-report constructs
  • ✅ CM-name validation for scenario YAMLs that opt in
  • ⏳ RBAC vs kubebuilder-marker reconciliation — defer; un-defer when we hit a third RBAC-class bug

Layer 2: single-source the CM name across YAML + binary

Two candidate shapes (both reviewers raised independently):

(a) Wrapper exports `SEI_WORKFLOW_VARS_CM=workflow-vars-${WORKFLOW_NAME}` env var; scenarios reference `$SEI_WORKFLOW_VARS_CM` via envsubst allow-list. Single string-builder lives in the wrapper bash. No new templating dependency.

(b) Render-time template helper `{{ workflowVarsCM }}` exposed in a scenario template engine. Aligns with how the runner subcommand already templates SeiNodeTask CRs and how provision-snd templates SND specs. Requires a scenario rendering engine the wrapper invokes (vs current envsubst).

Platform-engineer recommends (a) as the MVP; kubernetes-specialist recommends (b) longer-term. Both eliminate the manual-mirror failure mode.

Layer 3 (longer-term): `seitask scenario validate` subcommand

A schema-validator subcommand that:

  • Parses scenario YAML
  • Checks every `configMapRef.name` matches `WorkflowVarsName(metadataName)`
  • Checks every `--var=KEY=...` flag matches a documented input on the target subcommand
  • Checks every `$(VAR)` reference has a producer step earlier in the Serial
  • Run pre-commit (Husky/lefthook), in CI, and pre-apply in the wrapper

Catches more bugs than Layer 1 unit tests because it has access to the full scenario semantics (DAG ordering, var producer/consumer matching), not just the YAML structure.

Source: platform-engineer cross-review on #339.

Relevant experts

  • platform-engineer — owns the wrapper bash + envsubst contract; (a) lives entirely in their territory
  • kubernetes-specialist — owns the operator-pattern alignment for (b); also the rbac-marker reconciliation in Layer 1
  • product-engineer — should weigh in on which Layer 2 shape fits the longer-term scenario authoring DX

Acceptance criteria

This issue resolves when:

  • Layer 1 partially done (scheme + CM-name tests in fix(seitask): register sei.io scheme + grant workflownodes RBAC #339)
  • Layer 1 RBAC reconciliation test added (when justified)
  • Layer 2: pick (a) or (b) and migrate release-test + future scenarios to it
  • Layer 3: `seitask scenario validate` subcommand exists, runs in sei-k8s-controller CI on every PR that touches `scenarios/`, runs pre-apply in the wrapper

Out of scope

References

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions