fix: propagate Kafka broker exit code to pod phase#260
Conversation
The entrypoint script called `rm /var/run/wait/do-not-exit-yet` after kafka-server-start.sh, and since `rm` always exits 0, the broker pod would show phase Succeeded even after a crash (e.g. KafkaStorageException). Koperator would then recreate the pod with no backoff and no visible failure. Capture Kafka's exit code before the rm and exit with it, so a crashed broker produces pod phase Failed instead of Succeeded.
|
thanks for your contribution @DarkIsDude! |
|
@amuraru done. Sorry for missing that 😬 |
When koperator performs a rolling upgrade, Kubernetes sends SIGTERM to the Kafka container. If Kafka's shutdown hook does not complete within the grace period, SIGKILL is sent instead. Both result in exit codes ≥ 128 (128 + signal number), which our propagation logic was marking as pod failures, causing an extra koperator reconcile cycle and pushing the multi-disk-removal E2E test over its 3-minute readiness window. Signal-based exits (≥ 128) are controlled shutdowns orchestrated by koperator, not genuine Kafka failures. Only exit codes 1–127 indicate a real Kafka crash (e.g. KafkaStorageException) and should set the pod phase to Failed.
| # Signal-based exits are controlled shutdowns (e.g. SIGTERM during rolling upgrades), | ||
| # not genuine Kafka failures, so suppress them to avoid marking the pod as Failed. | ||
| if [ $KAFKA_EXIT -ge 128 ]; then | ||
| exit 0 |
There was a problem hiding this comment.
hmm - not sure about this.
e.g 137 can be an actual OOM/SIGKILL of the Java process, the wrapper exiting 0
makes the broker pod look Succeeded
Why no simply passing the exact value of $KAFKA_EXIT code?
There was a problem hiding this comment.
Tests was failing. I'm trying this one (not sure it was a flaky or not 😬). I can of course revert
There was a problem hiding this comment.
@amuraru tests are now green. It's a right concern so I changed my approach 🙏. Do you prefer to change tests instead ?
There was a problem hiding this comment.
Can we add a test for a graceful shutdown and see whether the handling of sigterm signal is actually correctly handled by the operator?
Only suppress exit code 143 (128+SIGTERM), which is sent by koperator during rolling upgrades. All other non-zero codes, including 137 (OOMKilled), are genuine failures and are propagated so the pod phase is set to Failed.
Problem
In
pkg/resources/kafka/wait-for-envoy-sidecar.sh, the entrypoint finishes with:rmalways exits0, so it overwrites Kafka's real exit code. When Kafka crashes(e.g.
KafkaStorageExceptionfrom a full disk), the broker pod reports phaseSucceededinstead ofFailed. Koperator then recreates the pod every ~2 s withno backoff and no visible failure signal — alerts never fire, and the root cause
is invisible in
kubectl get pods.Fix
Capture Kafka's exit code before
rmand propagate it:One crashed container is enough to set the pod phase to
Failed, so the failureis immediately visible and monitoring can alert on it.
Verification
Fill a broker's data disk to trigger a
KafkaStorageException:Before fix: pod phase →
Succeeded(crash hidden)After fix: pod phase →
Failed(crash visible)