Promote per-scenario wrapper bash to a seitask workflow-run subcommand (defer until N=3 scenarios)

## Problem

The CronJob wrapper bash introduced in sei-protocol/platform#627 contains ~80 lines that perform 7 operations every Workflow-based scenario will repeat verbatim:

1. Fetch scenario YAML from a sei-k8s-controller git ref
2. Allow-list-envsubst per-run scalars
3. `kubectl apply` the rendered Workflow CR
4. Poll `.status.phase` to terminal, with one-shot overrun warning at the 60m inner-deadline boundary
5. Capture terminal state (Workflow YAML + WorkflowNode tree + Task pod statuses) before deletion
6. Annotate-abort → bounded `delete --wait=true` → child-pod sweep
7. `aws s3 cp` orchestrator stdout to the validation bucket

None of this is release-test-specific. The question is whether the variance across scenarios fits a typed flag interface or genuinely needs per-scenario bespoke bash — and we won't know until N≥3 scenarios exist.

## Impact

Two failure modes if we get the timing wrong:

- **Promote on N=1 → wrong abstraction.** release-test's quirks become accidental contract; the second scenario fights the interface; the third hammers it into something unrecognizable. Expensive to undo because by then every scenario depends on the bad shape.
- **Never promote → drift + missed safety nets.** N copies of the same bash drift independently (different log formats, different cleanup ordering, different deadline math). More importantly: Cursor Bugbot already flagged two SHA-drift bugs on #627 (SEITASK_IMAGE vs controller-manager image; SEITASK_IMAGE vs vendored seitask-runner RBAC). A typed wrapper could enforce these as build-time invariants. Bash + human vigilance + Bugbot is the duct-tape version of that enforcement.

The phased pattern (build N=1..3 as duct-tape, **survey at N=3 before landing**, decide based on observed variance) is the right shape — but only if we actually pause at N=3.

## Relevant experts

- **platform-engineer** — image versioning lock-step, CronJob shape, scenario asset layout, build-time SHA enforcement
- **kubernetes-specialist** — Workflow CR lifecycle, ownerRefs cascade semantics, vendored RBAC drift
- **product-engineer** — defines what "extensible" means in the survey checkpoint (which knobs are scenario-intrinsic vs universal)
- **sre-engineer** — owns the terminal-state-capture + cleanup-race logic; should sign off that the typed wrapper preserves the failure-diagnosability the bash currently has

## Proposed approach

Phased, with an explicit survey checkpoint:

- **N=1 (release-test, sei-protocol/platform#627)** — landing now. Bash wrapper inline.
- **N=2 (next scenario, TBD)** — build the same bash-wrapper shape. **Resist abstracting preemptively.**
- **N=3 (third scenario, TBD)** — **survey checkpoint BEFORE landing**:
  - Inventory the variance across all 3 wrappers: which knobs differ, which are universal, which were YAGNI knobs we accidentally exposed.
  - Decide: extensible to N=10+, or does each instance need bespoke knobs after all?
- **Decision:**
  - **Extensible** → promote to `seitask workflow-run`. Sketch:
    ```yaml
    args:
      - workflow-run
      - --scenario=release-test
      - --scenario-repo=github.com/sei-protocol/sei-k8s-controller
      - --scenario-ref=$SCENARIO_REF
      - --bucket=harbor-validation-results
      - --var=SEID_IMAGE=$SEID_IMAGE
      - --var=SEITASK_IMAGE=$SEITASK_IMAGE
      - --var=RELEASE_TEST_IMAGE=$RELEASE_TEST_IMAGE
    ```
    Migrate all 3 scenarios in one PR. Add SHA-fingerprint validation that refuses to apply if image SHA doesn't match SCENARIO_REF's expected controller-manager + vendored-RBAC.
  - **Not extensible** → duct-tape was the right shape; keep per-scenario bash; document the variance.

## Acceptance criteria

This issue resolves when:

- [ ] At least 3 Workflow-based scenarios are running in production (release-test + 2 more).
- [ ] A survey-checkpoint write-up exists capturing observed cross-scenario variance.
- [ ] A decision is recorded on the issue: promote OR keep duct-tape, with reasoning.
- [ ] If promote: `seitask workflow-run` exists, all 3+ scenarios migrated, CronJob `args:` collapsed to typed flags, SHA-drift Bugbot rules superseded by build-time invariants.

## Out of scope

- The shared status-check template library — sibling issue at sei-protocol/sei-k8s-controller#330.
- Migrating the legacy `major-upgrade.yaml`'s half-bash-half-Workflow shape — that's its own retirement, Phase 2c+.
- Pageable alerting on CronJob failure — separate observability-platform-engineer concern.

## References

- sei-protocol/platform#627 — the N=1 wrapper
- sei-protocol/sei-k8s-controller#330 — sibling promotion-at-N=3 issue (status-check library)
- Memory: `feedback_prototype_first.md` survey-checkpoint pattern


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote per-scenario wrapper bash to a seitask workflow-run subcommand (defer until N=3 scenarios) #332

Problem

Impact

Relevant experts

Proposed approach

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Promote per-scenario wrapper bash to a seitask workflow-run subcommand (defer until N=3 scenarios) #332

Description

Problem

Impact

Relevant experts

Proposed approach

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions