Problem
Chaos Mesh v2.8.0's `Serial` template type marches through all children regardless of child outcome. Empirical evidence from the third manual fire of release-test (post #337):
- `keygen-admin` succeeded (exit 0).
- `provision-validator-chain` failed (exit 1, scheme registration bug).
- `provision-rpc-fleet` failed (exit 1, same bug).
- `run-release-test` ran anyway, errored on missing endpoint env ($(RPC_TM_RPC) unresolved because provision-snd never published).
- `upload-report` ran anyway, errored on missing workflownodes RBAC.
All four downstream WorkflowNodes showed `status.conditions[Accomplished]=True`, no `Failed` condition. Chaos Mesh marks each WorkflowNode "done" on pod termination, not on pod success.
Impact
For our test harness this is dangerous. A failure at `provision-validator-chain` means the chain doesn't exist; running `run-release-test` against an absent chain wastes ~30 minutes of cluster time and produces useless artifacts. Worse, the `upload-report` step that's supposed to capture diagnostic state for the failure case runs in the wrong order (after corrupted state from later steps).
The chaos-mesh-gaps memory hinted at this ("no `retry`/`retryStrategy`, deadline doesn't cascade"); the manual fire confirmed the inverse-of-expected behavior. Chaos Mesh's primary use case is fault injection where "the fault ran" is the assertion, so marching through child failures is upstream design intent — but it's the wrong shape for sequential test scenarios.
Relevant experts
- kubernetes-specialist — owns the Workflow CR semantics; identified the gap during the original chaos-mesh deep-dive.
- sre-engineer — owns failure-state capture; the current behavior means upload-report fires in the wrong order, defeating the diagnostic capture pattern.
- platform-engineer — owns the wrapper CronJob; one mitigation lives there (orchestrator polls EXIT_REASON, aborts Workflow externally).
- product-engineer — "test harness" vs "chaos engine" distinction is the load-bearing product framing.
Proposed approach
Three candidate mitigations, ranked by my read:
- Orchestrator-side EXIT_REASON polling + abort — the wrapper CronJob already polls `.status.phase`. Extend to ALSO read the workflow-vars CM's `EXIT_REASON` key (set by each seitask subcommand via `taskruntime.WriteExitReason`). On first non-empty EXIT_REASON != "pass", set the `workflow.chaos-mesh.org/abort=true` annotation. Workflow stops cleanly; upload-report fires inside the wrapper's trap. Mechanism we already have (annotate-abort pattern in the trap); just earlier trigger.
- `ConditionalBranches` template type — Chaos Mesh template kind we identified in the deep-dive but never used. Each step's continuation is gated on the prior step's exit-code via expression. Native to chaos-mesh but adds template complexity to every scenario.
- Bash-Task wrappers — wrap each seitask invocation in bash that runs the binary then explicitly POSTs the abort annotation on `exit 1`. Per-Task, repetitive, but no scenario-level template machinery.
`(1)` is the most surgical — keeps scenario YAML clean, leverages the EXIT_REASON contract that already exists, lives entirely in the wrapper. Worth prototyping first.
Acceptance criteria
Out of scope
References
🤖 Generated with Claude Code
Problem
Chaos Mesh v2.8.0's `Serial` template type marches through all children regardless of child outcome. Empirical evidence from the third manual fire of release-test (post #337):
All four downstream WorkflowNodes showed `status.conditions[Accomplished]=True`, no `Failed` condition. Chaos Mesh marks each WorkflowNode "done" on pod termination, not on pod success.
Impact
For our test harness this is dangerous. A failure at `provision-validator-chain` means the chain doesn't exist; running `run-release-test` against an absent chain wastes ~30 minutes of cluster time and produces useless artifacts. Worse, the `upload-report` step that's supposed to capture diagnostic state for the failure case runs in the wrong order (after corrupted state from later steps).
The chaos-mesh-gaps memory hinted at this ("no `retry`/`retryStrategy`, deadline doesn't cascade"); the manual fire confirmed the inverse-of-expected behavior. Chaos Mesh's primary use case is fault injection where "the fault ran" is the assertion, so marching through child failures is upstream design intent — but it's the wrong shape for sequential test scenarios.
Relevant experts
Proposed approach
Three candidate mitigations, ranked by my read:
`(1)` is the most surgical — keeps scenario YAML clean, leverages the EXIT_REASON contract that already exists, lives entirely in the wrapper. Worth prototyping first.
Acceptance criteria
Out of scope
References
🤖 Generated with Claude Code