Chaos Mesh Serial doesn't fail-fast on child Task errors — need explicit abort mechanism

## Problem

Chaos Mesh v2.8.0's \`Serial\` template type marches through all children regardless of child outcome. Empirical evidence from the third manual fire of release-test (post sei-protocol/sei-k8s-controller#337):

- \`keygen-admin\` succeeded (exit 0).
- \`provision-validator-chain\` failed (exit 1, scheme registration bug).
- \`provision-rpc-fleet\` failed (exit 1, same bug).
- \`run-release-test\` ran *anyway*, errored on missing endpoint env (\$(RPC_TM_RPC) unresolved because provision-snd never published).
- \`upload-report\` ran *anyway*, errored on missing workflownodes RBAC.

All four downstream WorkflowNodes showed \`status.conditions[Accomplished]=True\`, no \`Failed\` condition. Chaos Mesh marks each WorkflowNode \"done\" on pod termination, not on pod success.

## Impact

For our test harness this is dangerous. A failure at \`provision-validator-chain\` means the chain doesn't exist; running \`run-release-test\` against an absent chain wastes ~30 minutes of cluster time and produces useless artifacts. Worse, the \`upload-report\` step that's supposed to capture diagnostic state for the failure case runs in the wrong order (after corrupted state from later steps).

The chaos-mesh-gaps memory hinted at this (\"no \`retry\`/\`retryStrategy\`, deadline doesn't cascade\"); the manual fire confirmed the inverse-of-expected behavior. Chaos Mesh's primary use case is fault injection where \"the fault ran\" is the assertion, so marching through child failures is upstream design intent — but it's the wrong shape for sequential test scenarios.

## Relevant experts

- **kubernetes-specialist** — owns the Workflow CR semantics; identified the gap during the original chaos-mesh deep-dive.
- **sre-engineer** — owns failure-state capture; the current behavior means upload-report fires in the wrong order, defeating the diagnostic capture pattern.
- **platform-engineer** — owns the wrapper CronJob; one mitigation lives there (orchestrator polls EXIT_REASON, aborts Workflow externally).
- **product-engineer** — \"test harness\" vs \"chaos engine\" distinction is the load-bearing product framing.

## Proposed approach

Three candidate mitigations, ranked by my read:

1. **Orchestrator-side EXIT_REASON polling + abort** — the wrapper CronJob already polls \`.status.phase\`. Extend to ALSO read the workflow-vars CM's \`EXIT_REASON\` key (set by each seitask subcommand via \`taskruntime.WriteExitReason\`). On first non-empty EXIT_REASON != \"pass\", set the \`workflow.chaos-mesh.org/abort=true\` annotation. Workflow stops cleanly; upload-report fires inside the wrapper's trap. Mechanism we already have (annotate-abort pattern in the trap); just earlier trigger.
2. **\`ConditionalBranches\` template type** — Chaos Mesh template kind we identified in the deep-dive but never used. Each step's continuation is gated on the prior step's exit-code via expression. Native to chaos-mesh but adds template complexity to every scenario.
3. **Bash-Task wrappers** — wrap each seitask invocation in bash that runs the binary then explicitly POSTs the abort annotation on \`exit 1\`. Per-Task, repetitive, but no scenario-level template machinery.

\`(1)\` is the most surgical — keeps scenario YAML clean, leverages the EXIT_REASON contract that already exists, lives entirely in the wrapper. Worth prototyping first.

## Acceptance criteria

- [ ] First failing Task in a Workflow stops subsequent Task execution within ~30s of the failure
- [ ] Upload-report (or equivalent terminal diagnostic capture) still fires for the failed run
- [ ] Mechanism documented in the chaos-mesh-gaps memory + scenario authoring guide
- [ ] Validated end-to-end against release-test: artificially fail provision-validator-chain, verify run-release-test does NOT execute

## Out of scope

- Switching workflow engines (sibling tracking issue at sei-protocol/sei-k8s-controller#332 covers this longer-term)
- Retry policies — first solve fail-fast; retry comes later if needed

## References

- sei-protocol/sei-k8s-controller#339 — the fix for the underlying bugs that made this behavior obvious
- Memory: \`project_chaos_mesh_workflow_gaps.md\` — gap inventory from the original deep-dive

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaos Mesh Serial doesn't fail-fast on child Task errors — need explicit abort mechanism #340

Problem

Impact

Relevant experts

Proposed approach

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Chaos Mesh Serial doesn't fail-fast on child Task errors — need explicit abort mechanism #340

Description

Problem

Impact

Relevant experts

Proposed approach

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions