fix: propagate Kafka broker exit code to pod phase by DarkIsDude · Pull Request #260 · adobe/koperator

DarkIsDude · 2026-06-03T13:18:26Z

Problem

In pkg/resources/kafka/wait-for-envoy-sidecar.sh, the entrypoint finishes with:

/opt/kafka/bin/kafka-server-start.sh /config/broker-config
rm /var/run/wait/do-not-exit-yet

rm always exits 0, so it overwrites Kafka's real exit code. When Kafka crashes
(e.g. KafkaStorageException from a full disk), the broker pod reports phase
Succeeded instead of Failed. Koperator then recreates the pod every ~2 s with
no backoff and no visible failure signal — alerts never fire, and the root cause
is invisible in kubectl get pods.

Fix

Capture Kafka's exit code before rm and propagate it:

/opt/kafka/bin/kafka-server-start.sh /config/broker-config
KAFKA_EXIT=$?
rm /var/run/wait/do-not-exit-yet
exit $KAFKA_EXIT

One crashed container is enough to set the pod phase to Failed, so the failure
is immediately visible and monitoring can alert on it.

Verification

Fill a broker's data disk to trigger a KafkaStorageException:

kubectl exec -n kafka <broker-pod> -- dd if=/dev/urandom of=/kafka-logs/fill-disk bs=1M count=99999 || true

Before fix: pod phase → Succeeded (crash hidden)
After fix: pod phase → Failed (crash visible)

The entrypoint script called `rm /var/run/wait/do-not-exit-yet` after kafka-server-start.sh, and since `rm` always exits 0, the broker pod would show phase Succeeded even after a crash (e.g. KafkaStorageException). Koperator would then recreate the pod with no backoff and no visible failure. Capture Kafka's exit code before the rm and exit with it, so a crashed broker produces pod phase Failed instead of Succeeded.

amuraru · 2026-06-05T13:27:09Z

thanks for your contribution @DarkIsDude!
Can you please sign the contribution CLA ?

DarkIsDude · 2026-06-05T13:34:38Z

@amuraru done. Sorry for missing that 😬

When koperator performs a rolling upgrade, Kubernetes sends SIGTERM to the Kafka container. If Kafka's shutdown hook does not complete within the grace period, SIGKILL is sent instead. Both result in exit codes ≥ 128 (128 + signal number), which our propagation logic was marking as pod failures, causing an extra koperator reconcile cycle and pushing the multi-disk-removal E2E test over its 3-minute readiness window. Signal-based exits (≥ 128) are controlled shutdowns orchestrated by koperator, not genuine Kafka failures. Only exit codes 1–127 indicate a real Kafka crash (e.g. KafkaStorageException) and should set the pod phase to Failed.

amuraru · 2026-06-05T15:53:56Z

+# Signal-based exits are controlled shutdowns (e.g. SIGTERM during rolling upgrades),
+# not genuine Kafka failures, so suppress them to avoid marking the pod as Failed.
+if [ $KAFKA_EXIT -ge 128 ]; then
+    exit 0


hmm - not sure about this.
e.g 137 can be an actual OOM/SIGKILL of the Java process, the wrapper exiting 0
makes the broker pod look Succeeded

Why no simply passing the exact value of $KAFKA_EXIT code?

Tests was failing. I'm trying this one (not sure it was a flaky or not 😬). I can of course revert

@amuraru tests are now green. It's a right concern so I changed my approach 🙏. Do you prefer to change tests instead ?

Can we add a test for a graceful shutdown and see whether the handling of sigterm signal is actually correctly handled by the operator?

Only suppress exit code 143 (128+SIGTERM), which is sent by koperator during rolling upgrades. All other non-zero codes, including 137 (OOMKilled), are genuine failures and are propagated so the pod phase is set to Failed.

DarkIsDude mentioned this pull request Jun 3, 2026

🐛(ZENKO-5288) wrap kafka-server-start.sh to propagate broker exit code scality/Zenko#2428

Open

DarkIsDude closed this Jun 5, 2026

DarkIsDude reopened this Jun 5, 2026

amuraru reviewed Jun 5, 2026

View reviewed changes

DarkIsDude requested a review from amuraru June 12, 2026 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: propagate Kafka broker exit code to pod phase#260

fix: propagate Kafka broker exit code to pod phase#260
DarkIsDude wants to merge 3 commits into
adobe:masterfrom
DarkIsDude:fix/propagate-kafka-broker-exit-code

DarkIsDude commented Jun 3, 2026

Uh oh!

amuraru commented Jun 5, 2026

Uh oh!

DarkIsDude commented Jun 5, 2026

Uh oh!

amuraru Jun 5, 2026

Uh oh!

DarkIsDude Jun 5, 2026

Uh oh!

DarkIsDude Jun 5, 2026

Uh oh!

amuraru Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DarkIsDude commented Jun 3, 2026

Problem

Fix

Verification

Uh oh!

amuraru commented Jun 5, 2026

Uh oh!

DarkIsDude commented Jun 5, 2026

Uh oh!

amuraru Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

DarkIsDude Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

DarkIsDude Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

amuraru Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants