Issue
KafkaAdmin::list_topics in
src/transport/kafka/admin.rs calls
BaseConsumer::fetch_metadata(None, 10s) on a freshly-created
BaseConsumer that has never been polled and has no subscription.
The whole auto-discovering -> Resolved 0 topics cycle completes in
roughly 50-80ms. That is far too fast for a real cold bootstrap +
Metadata round-trip to a Kafka broker (admin Metadata is typically
100-500ms cold). The timing is consistent with librdkafka returning
the local empty metadata cache before the bootstrap connection
finishes.
Result: the resolver declares "no matching topics" and the caller
treats it as fatal at startup, even when the topic visibly exists on
the broker (kafka-topics.sh --list from inside the same compose
network confirms the topic is present).
Reproduces deterministically against dfe-loader consuming
default_land from apache/kafka:4.2.0 -- both on a fresh
make ci AND minutes after the broker is fully healthy.
Original theory was a depends_on race
(kafka-init exits before broker metadata propagates). That theory
is wrong: the failure reproduces long after the broker is healthy and
the topic is visible to the kafka CLI from inside the compose network.
Resolver wiring:
src/transport/kafka/topic_resolver.rs
- Call site:
src/transport/kafka/mod.rs:239
Caller-side symptom:
fatal: service error: Kafka error: Transport error: transport config error: Auto-discovery found no matching topics
Proposed solution
- Prime the admin consumer before the first
fetch_metadata call:
- Either issue a throwaway
poll(Duration::from_secs(0)) in a
loop until the client reports a connection
- OR replace the admin path with rdkafka's
AdminClient +
describe_cluster / list_topics (the proper rdkafka admin API,
which does its own bootstrap handshake)
- Independent of the priming fix, make the resolver retry with
exponential backoff for a configurable window
(e.g. kafka.topic_discovery_timeout, default 30s) before
declaring an empty result fatal.
- Treat an empty match-set as a recoverable state the consumer loop
polls on at runtime, not a startup-only fatal check. Topics can be
created after the consumer starts.
Additional info
Workaround used in dfe-docker: drop topic_regex in
config/loader/kafka.yaml and list topics explicitly under topics:.
Explicit-subscribe path skips the resolver entirely; librdkafka
resolves the topic lazily on subscribe and the race disappears.
# Workaround: explicit subscribe, no regex
kafka:
topics:
- default_land
# topic_regex: '.*_land$' # disabled to dodge the resolver race
Observed: rustlib 2.7.1 + dfe-loader.
Issue
KafkaAdmin::list_topicsinsrc/transport/kafka/admin.rs calls
BaseConsumer::fetch_metadata(None, 10s)on a freshly-createdBaseConsumerthat has never been polled and has no subscription.The whole
auto-discovering->Resolved 0 topicscycle completes inroughly 50-80ms. That is far too fast for a real cold bootstrap +
Metadata round-trip to a Kafka broker (admin Metadata is typically
100-500ms cold). The timing is consistent with librdkafka returning
the local empty metadata cache before the bootstrap connection
finishes.
Result: the resolver declares "no matching topics" and the caller
treats it as fatal at startup, even when the topic visibly exists on
the broker (
kafka-topics.sh --listfrom inside the same composenetwork confirms the topic is present).
Reproduces deterministically against
dfe-loaderconsumingdefault_landfromapache/kafka:4.2.0-- both on a freshmake ciAND minutes after the broker is fully healthy.Original theory was a
depends_onrace(
kafka-initexits before broker metadata propagates). That theoryis wrong: the failure reproduces long after the broker is healthy and
the topic is visible to the kafka CLI from inside the compose network.
Resolver wiring:
src/transport/kafka/topic_resolver.rssrc/transport/kafka/mod.rs:239Caller-side symptom:
fatal: service error: Kafka error: Transport error: transport config error: Auto-discovery found no matching topicsProposed solution
fetch_metadatacall:poll(Duration::from_secs(0))in aloop until the client reports a connection
AdminClient+describe_cluster/list_topics(the proper rdkafka admin API,which does its own bootstrap handshake)
exponential backoff for a configurable window
(e.g.
kafka.topic_discovery_timeout, default 30s) beforedeclaring an empty result fatal.
polls on at runtime, not a startup-only fatal check. Topics can be
created after the consumer starts.
Additional info
Workaround used in dfe-docker: drop
topic_regexinconfig/loader/kafka.yamland list topics explicitly undertopics:.Explicit-subscribe path skips the resolver entirely; librdkafka
resolves the topic lazily on subscribe and the race disappears.
Observed: rustlib 2.7.1 + dfe-loader.