Skip to content

[SPARK-55897][SQL][4.0] Handle UserDefinedType in ColumnarRow, ColumnarBatchRow, and ColumnarArray get()#55990

Open
james-willis wants to merge 1 commit into
apache:branch-4.0from
james-willis:backport-SPARK-55897-4.0
Open

[SPARK-55897][SQL][4.0] Handle UserDefinedType in ColumnarRow, ColumnarBatchRow, and ColumnarArray get()#55990
james-willis wants to merge 1 commit into
apache:branch-4.0from
james-willis:backport-SPARK-55897-4.0

Conversation

@james-willis
Copy link
Copy Markdown
Contributor

Backport of #54701 to branch-4.0.

What changes were proposed in this pull request?

ColumnarRow.get(), ColumnarBatchRow.get(), and ColumnarArray.get() throw SparkUnsupportedOperationException when called with a UserDefinedType because they have no branch to handle UDTs.

This PR adds UDT handling to all three methods:

  • ColumnarRow and ColumnarBatchRow: Add an instanceof UserDefinedType branch that recurses with udt.sqlType(), matching the pattern already used in SpecializedGettersReader.read().
  • ColumnarArray: Change the handleUserDefinedType flag from false to true in the existing call to SpecializedGettersReader.read().

Why are the changes needed?

The codegen path (CodeGenerator.getValue()) unwraps udt.sqlType() before generating accessor calls, so UDT columns work when whole-stage codegen is active. However, on the interpreted eval path — when codegen is disabled, falls back, or the number of fields exceeds spark.sql.codegen.maxFieldsGetStructField.nullSafeEval calls ColumnarRow.get(ordinal, udtType) directly, which hits the unhandled branch and throws.

Does this PR introduce any user-facing change?

Yes. UDT columns in columnar data sources (e.g., Parquet) now work correctly on the interpreted evaluation path. Previously they would throw SparkUnsupportedOperationException.

How was this patch tested?

Added 6 new tests in ColumnarBatchSuite covering all 3 methods x 2 UDT backing types (primitive IntegerType and complex StructType). Each test creates columnar vectors with UDT data and verifies that get() returns the correct value. Two helper UDT classes (TestIntUDT, TestStructWrapperUDT) are defined for the tests.

Cherry-picked from 472735c on master. The cherry-pick had a trivial conflict in ColumnarBatchSuite.scala: the neighboring [SPARK-55552] Variant test exists on branch-4.1+ but not on branch-4.0, so its insertion point was contested. Resolved by keeping only the SPARK-55897 tests (the Variant test is unrelated).

Was this patch authored or co-authored using generative AI tooling?

Yes. Opus 4.6

…chRow, and ColumnarArray get()

### What changes were proposed in this pull request?

`ColumnarRow.get()`, `ColumnarBatchRow.get()`, and `ColumnarArray.get()` throw `SparkUnsupportedOperationException` when called with a `UserDefinedType` because they have no branch to handle UDTs.

This PR adds UDT handling to all three methods:
- **ColumnarRow** and **ColumnarBatchRow**: Add an `instanceof UserDefinedType` branch that recurses with `udt.sqlType()`, matching the pattern already used in `SpecializedGettersReader.read()`.
- **ColumnarArray**: Change the `handleUserDefinedType` flag from `false` to `true` in the existing call to `SpecializedGettersReader.read()`.

### Why are the changes needed?

The codegen path (`CodeGenerator.getValue()`) unwraps `udt.sqlType()` before generating accessor calls, so UDT columns work when whole-stage codegen is active. However, on the interpreted eval path — when codegen is disabled, falls back, or the number of fields exceeds `spark.sql.codegen.maxFields` — `GetStructField.nullSafeEval` calls `ColumnarRow.get(ordinal, udtType)` directly, which hits the unhandled branch and throws.

### Does this PR introduce _any_ user-facing change?

Yes. UDT columns in columnar data sources (e.g., Parquet) now work correctly on the interpreted evaluation path. Previously they would throw `SparkUnsupportedOperationException`.

### How was this patch tested?

Added 6 new tests in `ColumnarBatchSuite` covering all 3 methods × 2 UDT backing types (primitive `IntegerType` and complex `StructType`). Each test creates columnar vectors with UDT data and verifies that `get()` returns the correct value. Two helper UDT classes (`TestIntUDT`, `TestStructWrapperUDT`) are defined for the
tests.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Opus 4.6

Closes apache#54701 from james-willis/columnar-row-udt-test.

Authored-by: jameswillis <james@wherobots.com>
Signed-off-by: Huaxin Gao <huaxin.gao11@gmail.com>
(cherry picked from commit 472735c)
@james-willis
Copy link
Copy Markdown
Contributor Author

@huaxingao here is the 4.0 port.

Copy link
Copy Markdown
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao
Copy link
Copy Markdown
Contributor

@james-willis Could you check why the CI failed?

@james-willis
Copy link
Copy Markdown
Contributor Author

@huaxingao It was that flakey Protobuf breaking change action. retry fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants