Problem
The CronJob wrapper bash introduced in sei-protocol/platform#627 contains ~80 lines that perform 7 operations every Workflow-based scenario will repeat verbatim:
- Fetch scenario YAML from a sei-k8s-controller git ref
- Allow-list-envsubst per-run scalars
kubectl apply the rendered Workflow CR
- Poll
.status.phase to terminal, with one-shot overrun warning at the 60m inner-deadline boundary
- Capture terminal state (Workflow YAML + WorkflowNode tree + Task pod statuses) before deletion
- Annotate-abort → bounded
delete --wait=true → child-pod sweep
aws s3 cp orchestrator stdout to the validation bucket
None of this is release-test-specific. The question is whether the variance across scenarios fits a typed flag interface or genuinely needs per-scenario bespoke bash — and we won't know until N≥3 scenarios exist.
Impact
Two failure modes if we get the timing wrong:
- Promote on N=1 → wrong abstraction. release-test's quirks become accidental contract; the second scenario fights the interface; the third hammers it into something unrecognizable. Expensive to undo because by then every scenario depends on the bad shape.
- Never promote → drift + missed safety nets. N copies of the same bash drift independently (different log formats, different cleanup ordering, different deadline math). More importantly: Cursor Bugbot already flagged two SHA-drift bugs on #627 (SEITASK_IMAGE vs controller-manager image; SEITASK_IMAGE vs vendored seitask-runner RBAC). A typed wrapper could enforce these as build-time invariants. Bash + human vigilance + Bugbot is the duct-tape version of that enforcement.
The phased pattern (build N=1..3 as duct-tape, survey at N=3 before landing, decide based on observed variance) is the right shape — but only if we actually pause at N=3.
Relevant experts
- platform-engineer — image versioning lock-step, CronJob shape, scenario asset layout, build-time SHA enforcement
- kubernetes-specialist — Workflow CR lifecycle, ownerRefs cascade semantics, vendored RBAC drift
- product-engineer — defines what "extensible" means in the survey checkpoint (which knobs are scenario-intrinsic vs universal)
- sre-engineer — owns the terminal-state-capture + cleanup-race logic; should sign off that the typed wrapper preserves the failure-diagnosability the bash currently has
Proposed approach
Phased, with an explicit survey checkpoint:
- N=1 (release-test, sei-protocol/platform#627) — landing now. Bash wrapper inline.
- N=2 (next scenario, TBD) — build the same bash-wrapper shape. Resist abstracting preemptively.
- N=3 (third scenario, TBD) — survey checkpoint BEFORE landing:
- Inventory the variance across all 3 wrappers: which knobs differ, which are universal, which were YAGNI knobs we accidentally exposed.
- Decide: extensible to N=10+, or does each instance need bespoke knobs after all?
- Decision:
- Extensible → promote to
seitask workflow-run. Sketch:
args:
- workflow-run
- --scenario=release-test
- --scenario-repo=github.com/sei-protocol/sei-k8s-controller
- --scenario-ref=$SCENARIO_REF
- --bucket=harbor-validation-results
- --var=SEID_IMAGE=$SEID_IMAGE
- --var=SEITASK_IMAGE=$SEITASK_IMAGE
- --var=RELEASE_TEST_IMAGE=$RELEASE_TEST_IMAGE
Migrate all 3 scenarios in one PR. Add SHA-fingerprint validation that refuses to apply if image SHA doesn't match SCENARIO_REF's expected controller-manager + vendored-RBAC.
- Not extensible → duct-tape was the right shape; keep per-scenario bash; document the variance.
Acceptance criteria
This issue resolves when:
Out of scope
References
Problem
The CronJob wrapper bash introduced in sei-protocol/platform#627 contains ~80 lines that perform 7 operations every Workflow-based scenario will repeat verbatim:
kubectl applythe rendered Workflow CR.status.phaseto terminal, with one-shot overrun warning at the 60m inner-deadline boundarydelete --wait=true→ child-pod sweepaws s3 cporchestrator stdout to the validation bucketNone of this is release-test-specific. The question is whether the variance across scenarios fits a typed flag interface or genuinely needs per-scenario bespoke bash — and we won't know until N≥3 scenarios exist.
Impact
Two failure modes if we get the timing wrong:
The phased pattern (build N=1..3 as duct-tape, survey at N=3 before landing, decide based on observed variance) is the right shape — but only if we actually pause at N=3.
Relevant experts
Proposed approach
Phased, with an explicit survey checkpoint:
seitask workflow-run. Sketch:Acceptance criteria
This issue resolves when:
seitask workflow-runexists, all 3+ scenarios migrated, CronJobargs:collapsed to typed flags, SHA-drift Bugbot rules superseded by build-time invariants.Out of scope
major-upgrade.yaml's half-bash-half-Workflow shape — that's its own retirement, Phase 2c+.References
feedback_prototype_first.mdsurvey-checkpoint pattern