Skip to content

A way to resolve an Avro reader schema based on the writer schema and the required Arrow schema #9575

@mzabaluev

Description

@mzabaluev

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In applications implementing evolution of data schemas abstracted above Avro, such as Iceberg, there is a need to resolve the Arrow schema that is required for the output record batches against the writer schema in the Avro files.

One example is datafusion-comet, where Spark passes down a SQL schema to read data from Avro files, which is converted to an Arrow schema for the "native" reader written in Rust, which streams RecordBatch items with that schema.

Describe the solution you'd like

A utility function to resolve the reader schema from these arguments:

  1. The Avro writer schema, which can be read from e.g. an OCF file using the HeaderInfo API added in feat(arrow-avro): HeaderInfo to expose OCF header #9548,
  2. The required Arrow schema for the output batches.

The resulting Avro schema can be used to construct an Avro reader using the with_reader_schema builder method.

The behavior of the recursive schema resolution, where it differs from Avro resolution rules:

  • Struct/record fields are matched by name. Reordering of fields is allowed and should be reported as a modification of the writer schema.
  • Fields in the Arrow schema not present in the writer schema are added to the reader schema with the conventional nullable union type and the default value of null. It is an error if an added field is not nullable.
  • As Arrow struct types are anonymous, the resulting record type in the reader schema receives the name and namespace attributes of the corresponding record type in the writer schema.
  • The name of an Arrow list item field is not attested in the corresponding Avro array schema. The record batch will have the "item" list field as currently hardcoded. A metadata approach for round-tripping can be further proposed.
  • Likewise, the names in the KV struct field making a map are not attested in Avro.

The resolution function should provide a way to determine if the resulting schema differs from the writer schema, so that the exact match could be processed in a fast path with no schema resolution.

Describe alternatives you've considered

The required Arrow schema could be passed to a builder method while constructing the reader, directly becoming the schema for the produced record batches. This would create duality with the reader schema option, and it's not clear how the two options would interact if used together. It is much easier to reduce the problem to applying an Avro reader schema, so that the core of schema resolution behavior could be coded in one place and adhere to the Avro specification.

Additional context

The wider topic of schema evolution and other adaptations, not specific to Avro, is discussed in #6735.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions