Skip to content

native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types #3720

@andygrove

Description

@andygrove

Description

When running Spark SQL tests with spark.comet.scan.impl=native_datafusion, several tests that expect errors on schema mismatch pass silently — DataFusion reads the data without throwing an exception, while Spark would throw a SparkException.

This affects cases where the read schema is incompatible with the physical Parquet schema:

  • Reading binary data as timestamp
  • Reading int as long
  • Reading TimestampLTZ as TimestampNTZ
  • Reading decimals with incompatible precision/scale
  • Reading int as bigint (row group skipping overflow)
  • Schema mismatch on vectorized reader (e.g., string column read as int)
  • Reading timestamp_ntz as array<timestamp_ntz>

DataFusion's Parquet reader is more permissive than Spark's and silently coerces or reads mismatched types instead of erroring.

Affected Tests

From ParquetIOSuite:

  • "SPARK-35640: read binary as timestamp should throw schema incompatible error"
  • "SPARK-35640: int as long should throw schema incompatible error"

From ParquetQuerySuite (via ParquetV1QuerySuite/ParquetV2QuerySuite):

  • "SPARK-36182: can't read TimestampLTZ as TimestampNTZ"
  • "SPARK-34212 Parquet should read decimals correctly"
  • "row group skipping doesn't overflow when reading into larger type"

From ParquetSchemaSuite:

  • "schema mismatch failure error message for parquet vectorized reader"
  • "SPARK-45604: schema mismatch failure error on timestamp_ntz to array<timestamp_ntz>"

Expected Behavior

DataFusion should detect schema incompatibilities and throw appropriate errors, matching Spark's behavior of rejecting incompatible type reads.

Parent Issue

Split from #3311.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions