Skip to content

[SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION#56002

Closed
szehon-ho wants to merge 10 commits into
apache:branch-4.2from
szehon-ho:insert-schema-evolution-missing-fields-4.2
Closed

[SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION#56002
szehon-ho wants to merge 10 commits into
apache:branch-4.2from
szehon-ho:insert-schema-evolution-missing-fields-4.2

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

Summary

Backport of #55427 to branch-4.2.

Adds support for INSERT INTO ... WITH SCHEMA EVOLUTION to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target table, mirroring existing MERGE INTO behavior gated by spark.sql.mergeNestedTypeCoercion.enabled.

Key changes:

  • New config: spark.sql.insertNestedTypeCoercion.enabled (internal, default false)
  • Refactor TableOutputResolver.resolveOutputColumns to use DefaultValueFillMode (NONE, FILL, RECURSE)
  • Enable RECURSE mode for V2 inserts when schema evolution and the coercion flag are both enabled
  • 17 new tests in InsertIntoSchemaEvolutionTests (via InsertIntoTests.scala)

Why are the changes needed?

MERGE INTO already supports nested type coercion when the source has fewer struct fields than the target. INSERT INTO WITH SCHEMA EVOLUTION lacked this capability, causing errors for legitimate schema-evolution workflows where older sources omit newer nested fields.

Does this PR introduce any user-facing change?

Yes. When spark.sql.insertNestedTypeCoercion.enabled is set to true (default false), INSERT INTO ... WITH SCHEMA EVOLUTION fills missing nested struct fields with null instead of failing.

How was this patch tested?

Cherry-picked from #55427 onto current branch-4.2 (bd8872a0cc7) with clean cherry-picks (no conflicts).

Original PR test plan:

  • Added comprehensive positive/negative tests in InsertIntoTests.scala
  • All matched tests passed on master

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4)

szehon-ho added 10 commits May 19, 2026 17:26
…SERT INTO WITH SCHEMA EVOLUTION

Add support for INSERT INTO WITH SCHEMA EVOLUTION to fill missing nested struct
fields with null (or column defaults) when the source has fewer fields than the
target, mirroring existing MERGE INTO behavior.

Changes:
- Add spark.sql.insertNestedTypeCoercion.enabled config flag (default false)
- Refactor TableOutputResolver.resolveOutputColumns to accept DefaultValueFillMode
  enum directly instead of two overlapping boolean parameters
- Enable RECURSE mode for V2 inserts when both schema evolution and the config
  flag are active
- Add comprehensive tests for all scenarios
…on coercion

Propagate fillDefaultValue through resolveArrayType and resolveMapType by-position
paths; use applyColumnMetadata for trailing default fills; clarify Analyzer and
SQLConf docs; extend DefaultValueFillMode scaladoc; fix by-name negative test
(with USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES disabled) and add by-position
array/map nested struct tests.
…s no default

When insert nested coercion + schema evolution fills by-position trailing
columns/fields, getDefaultValueExprOrNullLit can return None (e.g. nullable
column with useNullsForMissingDefaultColumnValues=false and no explicit
DEFAULT). The previous flatMap silently dropped those targets; mirror the
by-name path by throwing INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA.

Add regression tests for top-level and nested struct by-position cases.
…umn drops

After fixing trailing by-position default fill (throw when no default), add:
- Post-check on resolveColumnsByPosition: result length must match expected arity
- enforceFullOutput on reorderColumnsByName and nested struct/array/map resolvers:
  INSERT (resolveOutputColumns) throws on incomplete resolution; MERGE resolveUpdate
  keeps enforceFullOutput=false so getOrElse fallback semantics are unchanged.

Scalastyle: argcount off for the three nested resolver methods.
Align MERGE assignment resolution with INSERT: resolveStructType/Array/Map
now pass enforceFullOutput=true so incomplete nested resolution throws
instead of falling back via Option.

MergeInto schema evolution suites (824 tests across Group/Delta SQL+Scala)
and MergeIntoDataFrameSuite nested struct tests pass.
Revert the 2c56ade change that made incomplete nested resolution throw for
MERGE/UPDATE, restoring addError-based cast failure messages in
AlignMergeAssignmentsSuite and AlignUpdateAssignmentsSuite. Align two INSERT
negative tests with checkError for CANNOT_FIND_DATA.
@szehon-ho szehon-ho changed the title [SPARK-56550][SQL] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION (branch-4.2) [SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION May 20, 2026
@szehon-ho
Copy link
Copy Markdown
Member Author

@huaxingao @cloud-fan is it ok for this to go into 4.2 as well? Thanks

@huaxingao
Copy link
Copy Markdown
Contributor

@szehon-ho @cloud-fan

I plan to cut RC1 tomorrow. Can you evaluate the risk of merging this at the last minute? Would it be okay to wait for 4.2.1?

@cloud-fan
Copy link
Copy Markdown
Contributor

from dev-list, the rc1 cut is delayed to the end of this week, and this PR is needed to fix the behavior of an unreleased feature (flag off by default), so it's nearly no risk.

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to 4.2!

cloud-fan pushed a commit that referenced this pull request May 20, 2026
…NSERT INTO WITH SCHEMA EVOLUTION

## Summary

Backport of #55427 to `branch-4.2`.

Adds support for `INSERT INTO ... WITH SCHEMA EVOLUTION` to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target table, mirroring existing `MERGE INTO` behavior gated by `spark.sql.mergeNestedTypeCoercion.enabled`.

Key changes:
- New config: `spark.sql.insertNestedTypeCoercion.enabled` (internal, default `false`)
- Refactor `TableOutputResolver.resolveOutputColumns` to use `DefaultValueFillMode` (`NONE`, `FILL`, `RECURSE`)
- Enable `RECURSE` mode for V2 inserts when schema evolution and the coercion flag are both enabled
- 17 new tests in `InsertIntoSchemaEvolutionTests` (via `InsertIntoTests.scala`)

## Why are the changes needed?

`MERGE INTO` already supports nested type coercion when the source has fewer struct fields than the target. `INSERT INTO WITH SCHEMA EVOLUTION` lacked this capability, causing errors for legitimate schema-evolution workflows where older sources omit newer nested fields.

## Does this PR introduce _any_ user-facing change?

Yes. When `spark.sql.insertNestedTypeCoercion.enabled` is set to `true` (default `false`), `INSERT INTO ... WITH SCHEMA EVOLUTION` fills missing nested struct fields with null instead of failing.

## How was this patch tested?

Cherry-picked from #55427 onto current `branch-4.2` (`bd8872a0cc7`) with clean cherry-picks (no conflicts).

Original PR test plan:
- Added comprehensive positive/negative tests in `InsertIntoTests.scala`
- All matched tests passed on master

## Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4)

Closes #56002 from szehon-ho/insert-schema-evolution-missing-fields-4.2.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants