Skip to content

fix: propagate Kafka broker exit code to pod phase#260

Open
DarkIsDude wants to merge 3 commits into
adobe:masterfrom
DarkIsDude:fix/propagate-kafka-broker-exit-code
Open

fix: propagate Kafka broker exit code to pod phase#260
DarkIsDude wants to merge 3 commits into
adobe:masterfrom
DarkIsDude:fix/propagate-kafka-broker-exit-code

Conversation

@DarkIsDude

Copy link
Copy Markdown

Problem

In pkg/resources/kafka/wait-for-envoy-sidecar.sh, the entrypoint finishes with:

/opt/kafka/bin/kafka-server-start.sh /config/broker-config
rm /var/run/wait/do-not-exit-yet

rm always exits 0, so it overwrites Kafka's real exit code. When Kafka crashes
(e.g. KafkaStorageException from a full disk), the broker pod reports phase
Succeeded instead of Failed. Koperator then recreates the pod every ~2 s with
no backoff and no visible failure signal — alerts never fire, and the root cause
is invisible in kubectl get pods.

Fix

Capture Kafka's exit code before rm and propagate it:

/opt/kafka/bin/kafka-server-start.sh /config/broker-config
KAFKA_EXIT=$?
rm /var/run/wait/do-not-exit-yet
exit $KAFKA_EXIT

One crashed container is enough to set the pod phase to Failed, so the failure
is immediately visible and monitoring can alert on it.

Verification

Fill a broker's data disk to trigger a KafkaStorageException:

kubectl exec -n kafka <broker-pod> -- dd if=/dev/urandom of=/kafka-logs/fill-disk bs=1M count=99999 || true

Before fix: pod phase → Succeeded (crash hidden)
After fix: pod phase → Failed (crash visible)

The entrypoint script called `rm /var/run/wait/do-not-exit-yet` after
kafka-server-start.sh, and since `rm` always exits 0, the broker pod
would show phase Succeeded even after a crash (e.g. KafkaStorageException).
Koperator would then recreate the pod with no backoff and no visible failure.

Capture Kafka's exit code before the rm and exit with it, so a crashed
broker produces pod phase Failed instead of Succeeded.
@amuraru

amuraru commented Jun 5, 2026

Copy link
Copy Markdown

thanks for your contribution @DarkIsDude!
Can you please sign the contribution CLA ?

@DarkIsDude DarkIsDude closed this Jun 5, 2026
@DarkIsDude DarkIsDude reopened this Jun 5, 2026
@DarkIsDude

Copy link
Copy Markdown
Author

@amuraru done. Sorry for missing that 😬

When koperator performs a rolling upgrade, Kubernetes sends SIGTERM to
the Kafka container. If Kafka's shutdown hook does not complete within
the grace period, SIGKILL is sent instead. Both result in exit codes
≥ 128 (128 + signal number), which our propagation logic was marking
as pod failures, causing an extra koperator reconcile cycle and pushing
the multi-disk-removal E2E test over its 3-minute readiness window.

Signal-based exits (≥ 128) are controlled shutdowns orchestrated by
koperator, not genuine Kafka failures. Only exit codes 1–127 indicate
a real Kafka crash (e.g. KafkaStorageException) and should set the pod
phase to Failed.
# Signal-based exits are controlled shutdowns (e.g. SIGTERM during rolling upgrades),
# not genuine Kafka failures, so suppress them to avoid marking the pod as Failed.
if [ $KAFKA_EXIT -ge 128 ]; then
exit 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - not sure about this.
e.g 137 can be an actual OOM/SIGKILL of the Java process, the wrapper exiting 0
makes the broker pod look Succeeded

Why no simply passing the exact value of $KAFKA_EXIT code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests was failing. I'm trying this one (not sure it was a flaky or not 😬). I can of course revert

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amuraru tests are now green. It's a right concern so I changed my approach 🙏. Do you prefer to change tests instead ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for a graceful shutdown and see whether the handling of sigterm signal is actually correctly handled by the operator?

Only suppress exit code 143 (128+SIGTERM), which is sent by koperator
during rolling upgrades. All other non-zero codes, including 137
(OOMKilled), are genuine failures and are propagated so the pod phase
is set to Failed.
@DarkIsDude DarkIsDude requested a review from amuraru June 12, 2026 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants