Skip to content

Build a shared status-check template library for scenario-shared liveness assertions #330

@bdchatham

Description

@bdchatham

Problem

Across Phase 2 scenarios we'll need the same handful of chain-state liveness assertions over and over: "wait until height >= N", "assert chain halted at height H", "validator set size == N", "proposal X reached PASSED", "RPC stays up while we do Y". Today these live as bespoke bash steps inside individual Workflow YAMLs (e.g. `major-upgrade.yaml`'s `compute-target-height`, `wait-for-proposal-to-pass`, `resolve-proposal-id`) — each scenario re-implements the same shape of curl-poll-loop with slightly different parsing.

This duplication compounds: 8–15 scenarios × ~4 cross-cutting liveness checks = a lot of bash we'd rather not maintain in parallel.

Why it matters now

The product-engineer's cross-review on the Phase 2 design (#326) flagged this as the higher-leverage next investment vs. continuing to refine the provision-snd interface:

The higher-leverage primitive you're underweighting is not templates — it's scenario-shared status checks. `await-nodes-at-height`, `await-condition`, "wait-for-N-blocks", "assert chain halted at height H", "assert validator set == N" are the cross-scenario verbs that every future scenario needs and that bash is currently doing badly. A `StatusCheck`-style template library (`await-condition.yaml.tmpl` and `await-nodes-at-height.yaml.tmpl` already exist — the seed exists) compounds across all 8–15 future scenarios regardless of how provision-snd ends up.

Existing seed

`runner/templates/` already has two of these:

  • `await-condition.yaml.tmpl` — SeiNodeTask AwaitCondition over a height predicate
  • `await-nodes-at-height.yaml.tmpl` — multi-node version of the above

These get us part of the way for height-based assertions on SeiNode-mediated chain state. What we don't have:

  • Anything for gov/proposal state (`resolve-proposal-id`, `wait-for-proposal-to-pass` are still bespoke bash)
  • Anything for validator-set assertions
  • Anything for "halted at height H" (currently inferred from the absence of progress)
  • Anything that uses Chaos Mesh's native `StatusCheck` template type (separate from our SeiNodeTask runner Tasks) for HTTP probes that abort the Workflow on failure

Proposed direction (not a commitment)

  1. Inventory the bash poll-loops in `scenarios/major-upgrade.yaml` and `/Users/brandon/platform/clusters/harbor/nightly/release/configmap.yaml` (the legacy bash being retired) — extract the actual chain-state predicates being tested.
  2. Pick the 3–5 highest-frequency predicates that show up across 2+ scenarios.
  3. Decide the implementation surface per predicate:
    • SeiNodeTask AwaitCondition kind (Go runtime, runs via `seitask runner --template`) — fits height-based predicates the sidecar can already evaluate.
    • Chaos Mesh `StatusCheck` template with `AbortWithStatusCheck` — fits HTTP-probe predicates that should also gate Workflow execution (RPC liveness during a long upgrade wait).
    • Bash kubectl/curl Task with a shared template body — fallback for predicates that need ad-hoc parsing.
  4. Land them as `runner/templates/await-*.yaml.tmpl` + `scenarios//*.tmpl` references, with cross-scenario unit-test coverage via the `TestBundledTemplates_RenderClean` pattern (feat(seitask): parameterized provision-snd + release-test scenario #326 added this).

Out of scope

  • Specific SLI/SLO design for the harness itself — that's an observability-platform-engineer + sre-engineer follow-up.
  • Wholesale rewrite of `major-upgrade.yaml`'s bash steps. Refactor incrementally as each predicate gets a shared primitive.

Relevant references

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions