Skip to content

Chaos Mesh Serial doesn't fail-fast on child Task errors — need explicit abort mechanism #340

@bdchatham

Description

@bdchatham

Problem

Chaos Mesh v2.8.0's `Serial` template type marches through all children regardless of child outcome. Empirical evidence from the third manual fire of release-test (post #337):

  • `keygen-admin` succeeded (exit 0).
  • `provision-validator-chain` failed (exit 1, scheme registration bug).
  • `provision-rpc-fleet` failed (exit 1, same bug).
  • `run-release-test` ran anyway, errored on missing endpoint env ($(RPC_TM_RPC) unresolved because provision-snd never published).
  • `upload-report` ran anyway, errored on missing workflownodes RBAC.

All four downstream WorkflowNodes showed `status.conditions[Accomplished]=True`, no `Failed` condition. Chaos Mesh marks each WorkflowNode "done" on pod termination, not on pod success.

Impact

For our test harness this is dangerous. A failure at `provision-validator-chain` means the chain doesn't exist; running `run-release-test` against an absent chain wastes ~30 minutes of cluster time and produces useless artifacts. Worse, the `upload-report` step that's supposed to capture diagnostic state for the failure case runs in the wrong order (after corrupted state from later steps).

The chaos-mesh-gaps memory hinted at this ("no `retry`/`retryStrategy`, deadline doesn't cascade"); the manual fire confirmed the inverse-of-expected behavior. Chaos Mesh's primary use case is fault injection where "the fault ran" is the assertion, so marching through child failures is upstream design intent — but it's the wrong shape for sequential test scenarios.

Relevant experts

  • kubernetes-specialist — owns the Workflow CR semantics; identified the gap during the original chaos-mesh deep-dive.
  • sre-engineer — owns failure-state capture; the current behavior means upload-report fires in the wrong order, defeating the diagnostic capture pattern.
  • platform-engineer — owns the wrapper CronJob; one mitigation lives there (orchestrator polls EXIT_REASON, aborts Workflow externally).
  • product-engineer — "test harness" vs "chaos engine" distinction is the load-bearing product framing.

Proposed approach

Three candidate mitigations, ranked by my read:

  1. Orchestrator-side EXIT_REASON polling + abort — the wrapper CronJob already polls `.status.phase`. Extend to ALSO read the workflow-vars CM's `EXIT_REASON` key (set by each seitask subcommand via `taskruntime.WriteExitReason`). On first non-empty EXIT_REASON != "pass", set the `workflow.chaos-mesh.org/abort=true` annotation. Workflow stops cleanly; upload-report fires inside the wrapper's trap. Mechanism we already have (annotate-abort pattern in the trap); just earlier trigger.
  2. `ConditionalBranches` template type — Chaos Mesh template kind we identified in the deep-dive but never used. Each step's continuation is gated on the prior step's exit-code via expression. Native to chaos-mesh but adds template complexity to every scenario.
  3. Bash-Task wrappers — wrap each seitask invocation in bash that runs the binary then explicitly POSTs the abort annotation on `exit 1`. Per-Task, repetitive, but no scenario-level template machinery.

`(1)` is the most surgical — keeps scenario YAML clean, leverages the EXIT_REASON contract that already exists, lives entirely in the wrapper. Worth prototyping first.

Acceptance criteria

  • First failing Task in a Workflow stops subsequent Task execution within ~30s of the failure
  • Upload-report (or equivalent terminal diagnostic capture) still fires for the failed run
  • Mechanism documented in the chaos-mesh-gaps memory + scenario authoring guide
  • Validated end-to-end against release-test: artificially fail provision-validator-chain, verify run-release-test does NOT execute

Out of scope

References

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions