-
Notifications
You must be signed in to change notification settings - Fork 632
Description
Describe the issue
We found potential issues when using Spring Cloud Stream Kafka Binder with batch consumers + DLQ + manual acknowledgments.
According to the docs, in batch mode with DLQ enabled, all records from the previous poll are sent to DLQ. That matches current behavior.
However, in manual-ack scenarios we observed:
- A likely bug with partial commits using
acknowledge(index)+ DLQ. - A general improvement opportunity: DLQ handling appears to ignore already acknowledged/committed progress and republishes records that were already processed successfully.
Repro project (public): https://github.com/ferblaca/demoSCSKafka/tree/batch-consumer-dlq
To Reproduce
- Clone and open the repro project:
git clone -b batch-consumer-dlq https://github.com/ferblaca/demoSCSKafka.git
cd demoSCSKafka- Start Kafka/Zookeeper/Kafka UI using the provided compose file:
docker compose -f src/main/resources/compose/docker-compose.yml up -d-
In
src/main/resources/application.yml, set one function definition at a time:batchConsumerAckManual(scenario A)batchConsumerNackManual(scenario B)
-
Run the application.
-
Produce 100 records to
input-topic-batch(the project includes the producer bindingfoo-out-0to that topic, per config). -
In the batch consumer logic, force an exception at index 20.
-
Inspect topic/DLQ behavior in Kafka UI:
http://localhost:9090
Scenario A: acknowledge(index) (partial commit) + DLQ
Observed:
- After calling
acknowledgment.acknowledge(19), the internal partial state is set (partial=19). - During DLQ handling for the failed batch, acknowledgments continue over the full polled batch.
- The failure starts when processing approximately record 80/100, because the effective index reaches 100 (
80 + partial(19) + 1 = 100), which exceeds the batch bounds. - Then it throws:
IllegalArgumentException: index (100) is out of range (0-99) - The batch is retried multiple times.
- Final DLQ count becomes much larger than input (observed: 810 (10 retries x 81 messages) DLQ records for 100 input records).
Consumer behavior:
- Partial ack every 20 records:
acknowledgment.acknowledge(i) - Forced exception at
i == 20
Observed:
- After partial ack (
acknowledge(19)), internal partial state is set. - During DLQ flow for the failed batch, an index overflow occurs:
IllegalArgumentException: index (100) is out of range (0-99) - Batch is retried multiple times.
- Final DLQ count is much larger than input (observed: 810 DLQ records for 100 input records).
This looks like a bug in the interaction between partial batch ack state and DLQ processing:
Caused by: java.lang.IllegalArgumentException: index (100) is out of range (0-99)
Scenario B: nack(index, Duration) + DLQ
Same batch/DLQ/manual setup, but using:
acknowledgment.nack(i, Duration.ofMillis(100))on failure at index 20
Observed:
- No index out-of-range exception in this path.
- But DLQ still receives records in repeated waves as re-seek/retry progresses.
- Final DLQ count is also amplified (observed: 280 DLQ records for 100 input records).
Expected behavior
- Clarify/document whether manual ack (
acknowledge/nack) is officially supported with DLQ in batch mode. - If supported, fix the apparent bug for
acknowledge(index)+ DLQ (index out-of-range). - Improve DLQ error handling so that records already acknowledged/committed are not resent to DLQ (or provide a configurable strategy), reducing duplicate DLQ traffic and unnecessary reprocessing.
Additional context
- We understand Kafka processing is at-least-once and consumers must handle duplicates.
- Even so, current behavior appears to produce avoidable DLQ amplification in manual-ack batch use cases.
- We also ask for guidance on recommended patterns when using:
acknowledge(index)nack(index, Duration)BatchListenerFailedException(index)
together with DLQ in batch mode.
Version of the framework
- Spring Cloud: 2025.1.1
- Spring Boot: 4.0.2