Skip to content

Kafka topic auto-discovery returns empty when topics exist (race in admin metadata) #35

@kazmosahebi

Description

@kazmosahebi

Issue

KafkaAdmin::list_topics in
src/transport/kafka/admin.rs calls
BaseConsumer::fetch_metadata(None, 10s) on a freshly-created
BaseConsumer that has never been polled and has no subscription.

The whole auto-discovering -> Resolved 0 topics cycle completes in
roughly 50-80ms. That is far too fast for a real cold bootstrap +
Metadata round-trip to a Kafka broker (admin Metadata is typically
100-500ms cold). The timing is consistent with librdkafka returning
the local empty metadata cache before the bootstrap connection
finishes.

Result: the resolver declares "no matching topics" and the caller
treats it as fatal at startup, even when the topic visibly exists on
the broker (kafka-topics.sh --list from inside the same compose
network confirms the topic is present).

Reproduces deterministically against dfe-loader consuming
default_land from apache/kafka:4.2.0 -- both on a fresh
make ci AND minutes after the broker is fully healthy.

Original theory was a depends_on race
(kafka-init exits before broker metadata propagates). That theory
is wrong: the failure reproduces long after the broker is healthy and
the topic is visible to the kafka CLI from inside the compose network.

Resolver wiring:

  • src/transport/kafka/topic_resolver.rs
  • Call site: src/transport/kafka/mod.rs:239

Caller-side symptom:
fatal: service error: Kafka error: Transport error: transport config error: Auto-discovery found no matching topics

Proposed solution

  • Prime the admin consumer before the first fetch_metadata call:
    • Either issue a throwaway poll(Duration::from_secs(0)) in a
      loop until the client reports a connection
    • OR replace the admin path with rdkafka's AdminClient +
      describe_cluster / list_topics (the proper rdkafka admin API,
      which does its own bootstrap handshake)
  • Independent of the priming fix, make the resolver retry with
    exponential backoff for a configurable window
    (e.g. kafka.topic_discovery_timeout, default 30s) before
    declaring an empty result fatal.
  • Treat an empty match-set as a recoverable state the consumer loop
    polls on at runtime, not a startup-only fatal check. Topics can be
    created after the consumer starts.

Additional info

Workaround used in dfe-docker: drop topic_regex in
config/loader/kafka.yaml and list topics explicitly under topics:.
Explicit-subscribe path skips the resolver entirely; librdkafka
resolves the topic lazily on subscribe and the race disappears.

# Workaround: explicit subscribe, no regex
kafka:
  topics:
    - default_land
  # topic_regex: '.*_land$'   # disabled to dodge the resolver race

Observed: rustlib 2.7.1 + dfe-loader.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions