Skip to content

Planner does not re-fire apply-statefulset when StatefulSet is deleted post-bootstrap #284

@bdchatham

Description

@bdchatham

Problem

When a SeiNode reaches phase: Running, the controller's planner is finished. If the rendered StatefulSet is later deleted (manually for ops reasons, or as part of an aftercare sweep), the SeiNode reconciler does not detect the missing derived resource and re-fire the apply-statefulset task. The SeiNode stays in phase: Running indefinitely with no live pod.

Live-reproduced today against the state-size-analyzer SND in pacific-1: deleted the StatefulSet to force a sidecar-image re-render via the platform-default SEI_SIDECAR_IMAGE env. Controller never recreated it. Workaround required deleting the SeiNode itself so the SND's reconcileSeiNodes would recreate it fresh and the bootstrap plan would re-run from scratch — which also wipes the data PVC (SeiNode-owned), forcing a full state-sync redo.

Impact

Any ops procedure that deletes the StatefulSet becomes a one-way door — the SND won't bring seid back automatically. The most immediate consumer is the state-size-analysis CronJob (queued follow-up to platform state-size-analyzer.yaml) which is designed to scale replicas: 0, run an analyzer Job against the released PVC, and scale back — that pattern's correctness depends on the controller re-creating the StatefulSet on the scale-up path. Today operators have to know the "delete the SeiNode and accept PVC loss" workaround, which is a footgun: silent state loss for anyone who doesn't realize the cascade.

Relevant experts

  • kubernetes-specialist — controller planner + reconcile logic

Proposed approach

In the SeiNode reconciler, after the bootstrap plan completes (phase: Running), continue to assert derived resources exist on each reconcile pass. If the StatefulSet matching SeiNode.Name is missing, fire a new plan containing only the post-bootstrap apply tasks (apply-statefulset, apply-service, apply-rbac-proxy-config if TLS is enabled). Do not re-fire discover-peers / configure-state-sync / config-validate — those already ran and the existing PVC carries their result. The new plan should be a targeted "rebuild-derived-resources" flow, not a full bootstrap.

Acceptance criteria

  • Deleting the rendered StatefulSet on a Running SeiNode causes the controller to recreate it within one reconcile cycle
  • The new StatefulSet is rendered from current SeiNode.Spec (so any sidecar-image change since initial bootstrap is picked up via the platform default)
  • The data PVC is preserved across the recreate — no state-sync redo
  • Existing reconcile paths (chain-upgrade plan, peer re-discovery) remain unaffected

Out of scope

The related "ensure-data-pvc fails terminally when a stale orphaned PVC is present from a just-deleted SeiNode" race condition. That's a separate issue — that code path needs either auto-adoption logic or retry-with-backoff so K8s GC has time to catch up on the orphan. Filing separately if/when it comes up again.

References

  • Live reproduction today on the pacific-1/state-size-analyzer SND
  • Triggering context: cycling the pod to pick up the platform-default SEI_SIDECAR_IMAGE after platform PR #590 removed an inline sidecar-image pin from the SND spec

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions