Skip to content

Promote per-scenario wrapper bash to a seitask workflow-run subcommand (defer until N=3 scenarios) #332

@bdchatham

Description

@bdchatham

Problem

The CronJob wrapper bash introduced in sei-protocol/platform#627 contains ~80 lines that perform 7 operations every Workflow-based scenario will repeat verbatim:

  1. Fetch scenario YAML from a sei-k8s-controller git ref
  2. Allow-list-envsubst per-run scalars
  3. kubectl apply the rendered Workflow CR
  4. Poll .status.phase to terminal, with one-shot overrun warning at the 60m inner-deadline boundary
  5. Capture terminal state (Workflow YAML + WorkflowNode tree + Task pod statuses) before deletion
  6. Annotate-abort → bounded delete --wait=true → child-pod sweep
  7. aws s3 cp orchestrator stdout to the validation bucket

None of this is release-test-specific. The question is whether the variance across scenarios fits a typed flag interface or genuinely needs per-scenario bespoke bash — and we won't know until N≥3 scenarios exist.

Impact

Two failure modes if we get the timing wrong:

  • Promote on N=1 → wrong abstraction. release-test's quirks become accidental contract; the second scenario fights the interface; the third hammers it into something unrecognizable. Expensive to undo because by then every scenario depends on the bad shape.
  • Never promote → drift + missed safety nets. N copies of the same bash drift independently (different log formats, different cleanup ordering, different deadline math). More importantly: Cursor Bugbot already flagged two SHA-drift bugs on #627 (SEITASK_IMAGE vs controller-manager image; SEITASK_IMAGE vs vendored seitask-runner RBAC). A typed wrapper could enforce these as build-time invariants. Bash + human vigilance + Bugbot is the duct-tape version of that enforcement.

The phased pattern (build N=1..3 as duct-tape, survey at N=3 before landing, decide based on observed variance) is the right shape — but only if we actually pause at N=3.

Relevant experts

  • platform-engineer — image versioning lock-step, CronJob shape, scenario asset layout, build-time SHA enforcement
  • kubernetes-specialist — Workflow CR lifecycle, ownerRefs cascade semantics, vendored RBAC drift
  • product-engineer — defines what "extensible" means in the survey checkpoint (which knobs are scenario-intrinsic vs universal)
  • sre-engineer — owns the terminal-state-capture + cleanup-race logic; should sign off that the typed wrapper preserves the failure-diagnosability the bash currently has

Proposed approach

Phased, with an explicit survey checkpoint:

  • N=1 (release-test, sei-protocol/platform#627) — landing now. Bash wrapper inline.
  • N=2 (next scenario, TBD) — build the same bash-wrapper shape. Resist abstracting preemptively.
  • N=3 (third scenario, TBD)survey checkpoint BEFORE landing:
    • Inventory the variance across all 3 wrappers: which knobs differ, which are universal, which were YAGNI knobs we accidentally exposed.
    • Decide: extensible to N=10+, or does each instance need bespoke knobs after all?
  • Decision:
    • Extensible → promote to seitask workflow-run. Sketch:
      args:
        - workflow-run
        - --scenario=release-test
        - --scenario-repo=github.com/sei-protocol/sei-k8s-controller
        - --scenario-ref=$SCENARIO_REF
        - --bucket=harbor-validation-results
        - --var=SEID_IMAGE=$SEID_IMAGE
        - --var=SEITASK_IMAGE=$SEITASK_IMAGE
        - --var=RELEASE_TEST_IMAGE=$RELEASE_TEST_IMAGE
      Migrate all 3 scenarios in one PR. Add SHA-fingerprint validation that refuses to apply if image SHA doesn't match SCENARIO_REF's expected controller-manager + vendored-RBAC.
    • Not extensible → duct-tape was the right shape; keep per-scenario bash; document the variance.

Acceptance criteria

This issue resolves when:

  • At least 3 Workflow-based scenarios are running in production (release-test + 2 more).
  • A survey-checkpoint write-up exists capturing observed cross-scenario variance.
  • A decision is recorded on the issue: promote OR keep duct-tape, with reasoning.
  • If promote: seitask workflow-run exists, all 3+ scenarios migrated, CronJob args: collapsed to typed flags, SHA-drift Bugbot rules superseded by build-time invariants.

Out of scope

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions