Problem
When a SeiNode reaches phase: Running, the controller's planner is finished. If the rendered StatefulSet is later deleted (manually for ops reasons, or as part of an aftercare sweep), the SeiNode reconciler does not detect the missing derived resource and re-fire the apply-statefulset task. The SeiNode stays in phase: Running indefinitely with no live pod.
Live-reproduced today against the state-size-analyzer SND in pacific-1: deleted the StatefulSet to force a sidecar-image re-render via the platform-default SEI_SIDECAR_IMAGE env. Controller never recreated it. Workaround required deleting the SeiNode itself so the SND's reconcileSeiNodes would recreate it fresh and the bootstrap plan would re-run from scratch — which also wipes the data PVC (SeiNode-owned), forcing a full state-sync redo.
Impact
Any ops procedure that deletes the StatefulSet becomes a one-way door — the SND won't bring seid back automatically. The most immediate consumer is the state-size-analysis CronJob (queued follow-up to platform state-size-analyzer.yaml) which is designed to scale replicas: 0, run an analyzer Job against the released PVC, and scale back — that pattern's correctness depends on the controller re-creating the StatefulSet on the scale-up path. Today operators have to know the "delete the SeiNode and accept PVC loss" workaround, which is a footgun: silent state loss for anyone who doesn't realize the cascade.
Relevant experts
kubernetes-specialist — controller planner + reconcile logic
Proposed approach
In the SeiNode reconciler, after the bootstrap plan completes (phase: Running), continue to assert derived resources exist on each reconcile pass. If the StatefulSet matching SeiNode.Name is missing, fire a new plan containing only the post-bootstrap apply tasks (apply-statefulset, apply-service, apply-rbac-proxy-config if TLS is enabled). Do not re-fire discover-peers / configure-state-sync / config-validate — those already ran and the existing PVC carries their result. The new plan should be a targeted "rebuild-derived-resources" flow, not a full bootstrap.
Acceptance criteria
- Deleting the rendered StatefulSet on a Running SeiNode causes the controller to recreate it within one reconcile cycle
- The new StatefulSet is rendered from current
SeiNode.Spec (so any sidecar-image change since initial bootstrap is picked up via the platform default)
- The data PVC is preserved across the recreate — no state-sync redo
- Existing reconcile paths (chain-upgrade plan, peer re-discovery) remain unaffected
Out of scope
The related "ensure-data-pvc fails terminally when a stale orphaned PVC is present from a just-deleted SeiNode" race condition. That's a separate issue — that code path needs either auto-adoption logic or retry-with-backoff so K8s GC has time to catch up on the orphan. Filing separately if/when it comes up again.
References
- Live reproduction today on the
pacific-1/state-size-analyzer SND
- Triggering context: cycling the pod to pick up the platform-default
SEI_SIDECAR_IMAGE after platform PR #590 removed an inline sidecar-image pin from the SND spec
Problem
When a SeiNode reaches
phase: Running, the controller's planner is finished. If the rendered StatefulSet is later deleted (manually for ops reasons, or as part of an aftercare sweep), the SeiNode reconciler does not detect the missing derived resource and re-fire theapply-statefulsettask. The SeiNode stays inphase: Runningindefinitely with no live pod.Live-reproduced today against the
state-size-analyzerSND inpacific-1: deleted the StatefulSet to force a sidecar-image re-render via the platform-defaultSEI_SIDECAR_IMAGEenv. Controller never recreated it. Workaround required deleting the SeiNode itself so the SND'sreconcileSeiNodeswould recreate it fresh and the bootstrap plan would re-run from scratch — which also wipes the data PVC (SeiNode-owned), forcing a full state-sync redo.Impact
Any ops procedure that deletes the StatefulSet becomes a one-way door — the SND won't bring seid back automatically. The most immediate consumer is the state-size-analysis CronJob (queued follow-up to platform
state-size-analyzer.yaml) which is designed to scalereplicas: 0, run an analyzer Job against the released PVC, and scale back — that pattern's correctness depends on the controller re-creating the StatefulSet on the scale-up path. Today operators have to know the "delete the SeiNode and accept PVC loss" workaround, which is a footgun: silent state loss for anyone who doesn't realize the cascade.Relevant experts
kubernetes-specialist— controller planner + reconcile logicProposed approach
In the SeiNode reconciler, after the bootstrap plan completes (
phase: Running), continue to assert derived resources exist on each reconcile pass. If the StatefulSet matchingSeiNode.Nameis missing, fire a new plan containing only the post-bootstrap apply tasks (apply-statefulset,apply-service,apply-rbac-proxy-configif TLS is enabled). Do not re-firediscover-peers/configure-state-sync/config-validate— those already ran and the existing PVC carries their result. The new plan should be a targeted "rebuild-derived-resources" flow, not a full bootstrap.Acceptance criteria
SeiNode.Spec(so any sidecar-image change since initial bootstrap is picked up via the platform default)Out of scope
The related "
ensure-data-pvcfails terminally when a stale orphaned PVC is present from a just-deleted SeiNode" race condition. That's a separate issue — that code path needs either auto-adoption logic or retry-with-backoff so K8s GC has time to catch up on the orphan. Filing separately if/when it comes up again.References
pacific-1/state-size-analyzerSNDSEI_SIDECAR_IMAGEafter platform PR #590 removed an inline sidecar-image pin from the SND spec