You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In applications implementing evolution of data schemas abstracted above Avro, such as Iceberg, there is a need to resolve the Arrow schema that is required for the output record batches against the writer schema in the Avro files.
One example is datafusion-comet, where Spark passes down a SQL schema to read data from Avro files, which is converted to an Arrow schema for the "native" reader written in Rust, which streams RecordBatch items with that schema.
Describe the solution you'd like
A utility function to resolve the reader schema from these arguments:
The resulting Avro schema can be used to construct an Avro reader using the with_reader_schema builder method.
The behavior of the recursive schema resolution, where it differs from Avro resolution rules:
Struct/record fields are matched by name. Reordering of fields is allowed and should be reported as a modification of the writer schema.
Fields in the Arrow schema not present in the writer schema are added to the reader schema with the conventional nullable union type and the default value of null. It is an error if an added field is not nullable.
As Arrow struct types are anonymous, the resulting record type in the reader schema receives the name and namespace attributes of the corresponding record type in the writer schema.
The name of an Arrow list item field is not attested in the corresponding Avro array schema. The record batch will have the "item" list field as currently hardcoded. A metadata approach for round-tripping can be further proposed.
Likewise, the names in the KV struct field making a map are not attested in Avro.
The resolution function should provide a way to determine if the resulting schema differs from the writer schema, so that the exact match could be processed in a fast path with no schema resolution.
Describe alternatives you've considered
The required Arrow schema could be passed to a builder method while constructing the reader, directly becoming the schema for the produced record batches. This would create duality with the reader schema option, and it's not clear how the two options would interact if used together. It is much easier to reduce the problem to applying an Avro reader schema, so that the core of schema resolution behavior could be coded in one place and adhere to the Avro specification.
Additional context
The wider topic of schema evolution and other adaptations, not specific to Avro, is discussed in #6735.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In applications implementing evolution of data schemas abstracted above Avro, such as Iceberg, there is a need to resolve the Arrow schema that is required for the output record batches against the writer schema in the Avro files.
One example is datafusion-comet, where Spark passes down a SQL schema to read data from Avro files, which is converted to an Arrow schema for the "native" reader written in Rust, which streams RecordBatch items with that schema.
Describe the solution you'd like
A utility function to resolve the reader schema from these arguments:
HeaderInfoAPI added in feat(arrow-avro):HeaderInfoto expose OCF header #9548,The resulting Avro schema can be used to construct an Avro reader using the
with_reader_schemabuilder method.The behavior of the recursive schema resolution, where it differs from Avro resolution rules:
null. It is an error if an added field is not nullable.The resolution function should provide a way to determine if the resulting schema differs from the writer schema, so that the exact match could be processed in a fast path with no schema resolution.
Describe alternatives you've considered
The required Arrow schema could be passed to a builder method while constructing the reader, directly becoming the schema for the produced record batches. This would create duality with the reader schema option, and it's not clear how the two options would interact if used together. It is much easier to reduce the problem to applying an Avro reader schema, so that the core of schema resolution behavior could be coded in one place and adhere to the Avro specification.
Additional context
The wider topic of schema evolution and other adaptations, not specific to Avro, is discussed in #6735.