[SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION by szehon-ho · Pull Request #56002 · apache/spark

szehon-ho · 2026-05-20T00:26:53Z

Summary

Backport of #55427 to branch-4.2.

Adds support for INSERT INTO ... WITH SCHEMA EVOLUTION to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target table, mirroring existing MERGE INTO behavior gated by spark.sql.mergeNestedTypeCoercion.enabled.

Key changes:

New config: spark.sql.insertNestedTypeCoercion.enabled (internal, default false)
Refactor TableOutputResolver.resolveOutputColumns to use DefaultValueFillMode (NONE, FILL, RECURSE)
Enable RECURSE mode for V2 inserts when schema evolution and the coercion flag are both enabled
17 new tests in InsertIntoSchemaEvolutionTests (via InsertIntoTests.scala)

Why are the changes needed?

MERGE INTO already supports nested type coercion when the source has fewer struct fields than the target. INSERT INTO WITH SCHEMA EVOLUTION lacked this capability, causing errors for legitimate schema-evolution workflows where older sources omit newer nested fields.

Does this PR introduce any user-facing change?

Yes. When spark.sql.insertNestedTypeCoercion.enabled is set to true (default false), INSERT INTO ... WITH SCHEMA EVOLUTION fills missing nested struct fields with null instead of failing.

How was this patch tested?

Cherry-picked from #55427 onto current branch-4.2 (bd8872a0cc7) with clean cherry-picks (no conflicts).

Original PR test plan:

Added comprehensive positive/negative tests in InsertIntoTests.scala
All matched tests passed on master

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4)

…SERT INTO WITH SCHEMA EVOLUTION Add support for INSERT INTO WITH SCHEMA EVOLUTION to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target, mirroring existing MERGE INTO behavior. Changes: - Add spark.sql.insertNestedTypeCoercion.enabled config flag (default false) - Refactor TableOutputResolver.resolveOutputColumns to accept DefaultValueFillMode enum directly instead of two overlapping boolean parameters - Enable RECURSE mode for V2 inserts when both schema evolution and the config flag are active - Add comprehensive tests for all scenarios

…umn tests

…on coercion Propagate fillDefaultValue through resolveArrayType and resolveMapType by-position paths; use applyColumnMetadata for trailing default fills; clarify Analyzer and SQLConf docs; extend DefaultValueFillMode scaladoc; fix by-name negative test (with USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES disabled) and add by-position array/map nested struct tests.

…s no default When insert nested coercion + schema evolution fills by-position trailing columns/fields, getDefaultValueExprOrNullLit can return None (e.g. nullable column with useNullsForMissingDefaultColumnValues=false and no explicit DEFAULT). The previous flatMap silently dropped those targets; mirror the by-name path by throwing INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA. Add regression tests for top-level and nested struct by-position cases.

…umn drops After fixing trailing by-position default fill (throw when no default), add: - Post-check on resolveColumnsByPosition: result length must match expected arity - enforceFullOutput on reorderColumnsByName and nested struct/array/map resolvers: INSERT (resolveOutputColumns) throws on incomplete resolution; MERGE resolveUpdate keeps enforceFullOutput=false so getOrElse fallback semantics are unchanged. Scalastyle: argcount off for the three nested resolver methods.

Align MERGE assignment resolution with INSERT: resolveStructType/Array/Map now pass enforceFullOutput=true so incomplete nested resolution throws instead of falling back via Option. MergeInto schema evolution suites (824 tests across Group/Delta SQL+Scala) and MergeIntoDataFrameSuite nested struct tests pass.

Revert the 2c56ade change that made incomplete nested resolution throw for MERGE/UPDATE, restoring addError-based cast failure messages in AlignMergeAssignmentsSuite and AlignUpdateAssignmentsSuite. Align two INSERT negative tests with checkError for CANNOT_FIND_DATA.

szehon-ho · 2026-05-20T00:28:45Z

@huaxingao @cloud-fan is it ok for this to go into 4.2 as well? Thanks

huaxingao · 2026-05-20T01:30:48Z

@szehon-ho @cloud-fan

I plan to cut RC1 tomorrow. Can you evaluate the risk of merging this at the last minute? Would it be okay to wait for 4.2.1?

cloud-fan · 2026-05-20T13:06:20Z

from dev-list, the rc1 cut is delayed to the end of this week, and this PR is needed to fix the behavior of an unreleased feature (flag off by default), so it's nearly no risk.

cloud-fan · 2026-05-20T13:06:29Z

thanks, merging to 4.2!

…NSERT INTO WITH SCHEMA EVOLUTION ## Summary Backport of #55427 to `branch-4.2`. Adds support for `INSERT INTO ... WITH SCHEMA EVOLUTION` to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target table, mirroring existing `MERGE INTO` behavior gated by `spark.sql.mergeNestedTypeCoercion.enabled`. Key changes: - New config: `spark.sql.insertNestedTypeCoercion.enabled` (internal, default `false`) - Refactor `TableOutputResolver.resolveOutputColumns` to use `DefaultValueFillMode` (`NONE`, `FILL`, `RECURSE`) - Enable `RECURSE` mode for V2 inserts when schema evolution and the coercion flag are both enabled - 17 new tests in `InsertIntoSchemaEvolutionTests` (via `InsertIntoTests.scala`) ## Why are the changes needed? `MERGE INTO` already supports nested type coercion when the source has fewer struct fields than the target. `INSERT INTO WITH SCHEMA EVOLUTION` lacked this capability, causing errors for legitimate schema-evolution workflows where older sources omit newer nested fields. ## Does this PR introduce _any_ user-facing change? Yes. When `spark.sql.insertNestedTypeCoercion.enabled` is set to `true` (default `false`), `INSERT INTO ... WITH SCHEMA EVOLUTION` fills missing nested struct fields with null instead of failing. ## How was this patch tested? Cherry-picked from #55427 onto current `branch-4.2` (`bd8872a0cc7`) with clean cherry-picks (no conflicts). Original PR test plan: - Added comprehensive positive/negative tests in `InsertIntoTests.scala` - All matched tests passed on master ## Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4) Closes #56002 from szehon-ho/insert-schema-evolution-missing-fields-4.2. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

szehon-ho added 10 commits May 19, 2026 17:26

Move DefaultValueFillMode import to top of file

b5174c0

Cleanup

8cc43e2

Address review comments: add clarifying comment and extra+missing col…

4aff22f

…umn tests

Add insertNestedTypeCoercion.enabled to binding policy exceptions

77f74eb

szehon-ho changed the title ~~[SPARK-56550][SQL] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION (branch-4.2)~~ [SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION May 20, 2026

szehon-ho mentioned this pull request May 20, 2026

[SPARK-56550][SQL] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION #55427

Closed

cloud-fan approved these changes May 20, 2026

View reviewed changes

cloud-fan closed this May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION#56002

[SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION#56002
szehon-ho wants to merge 10 commits into
apache:branch-4.2from
szehon-ho:insert-schema-evolution-missing-fields-4.2

szehon-ho commented May 20, 2026

Uh oh!

szehon-ho commented May 20, 2026

Uh oh!

huaxingao commented May 20, 2026

Uh oh!

cloud-fan commented May 20, 2026

Uh oh!

cloud-fan commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

szehon-ho commented May 20, 2026

Summary

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho commented May 20, 2026

Uh oh!

huaxingao commented May 20, 2026

Uh oh!

cloud-fan commented May 20, 2026

Uh oh!

cloud-fan commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants