A way to resolve an Avro reader schema based on the writer schema and the required Arrow schema

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

In applications implementing evolution of data schemas abstracted above Avro, such as Iceberg, there is a need to resolve the Arrow schema that is required for the output record batches against the writer schema in the Avro files.

One example is datafusion-comet, where Spark passes down a SQL schema to read data from Avro files, which is converted to an Arrow schema for the "native" reader written in Rust, which streams RecordBatch items with that schema.

**Describe the solution you'd like**

A utility function to resolve the reader schema from these arguments:

1. The Avro writer schema, which can be read from e.g. an OCF file using the `HeaderInfo` API added in #9548,
2. The required Arrow schema for the output batches.

The resulting Avro schema can be used to construct an Avro reader using the `with_reader_schema` builder method.

The behavior of the recursive schema resolution, where it differs from Avro resolution rules:

* Struct/record fields are matched by name. Reordering of fields is allowed and should be reported as a modification of the writer schema.
* Fields in the Arrow schema not present in the writer schema are added to the reader schema with the conventional nullable union type and the default value of `null`. It is an error if an added field is not nullable.
* As Arrow struct types are anonymous, the resulting record type in the reader schema receives the name and namespace attributes of the corresponding record type in the writer schema.
* The name of an Arrow list item field is not attested in the corresponding Avro array schema. The record batch will have the "item" list field as [currently hardcoded](https://github.com/apache/arrow-rs/blob/88422cbdcbfa8f4e2411d66578dd3582fafbf2a1/arrow-avro/src/reader/record.rs#L462). A metadata approach for round-tripping can be further proposed.
* Likewise, the names in the KV struct field making a map are not attested in Avro.

The resolution function should provide a way to determine if the resulting schema differs from the writer schema, so that the exact match could be processed in a fast path with no schema resolution.

**Describe alternatives you've considered**

The required Arrow schema could be passed to a builder method while constructing the reader, directly becoming the schema for the produced record batches. This would create duality with the reader schema option, and it's not clear how the two options would interact if used together. It is much easier to reduce the problem to applying an Avro reader schema, so that the core of schema resolution behavior could be coded in one place and adhere to the Avro specification.

**Additional context**

The wider topic of schema evolution and other adaptations, not specific to Avro, is discussed in #6735.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A way to resolve an Avro reader schema based on the writer schema and the required Arrow schema #9575

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

A way to resolve an Avro reader schema based on the writer schema and the required Arrow schema #9575

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions