[SPARK-56550][SQL][4.2] Support source with fewer columns/fields in INSERT INTO WITH SCHEMA EVOLUTION#56002
Closed
szehon-ho wants to merge 10 commits into
Closed
Conversation
…SERT INTO WITH SCHEMA EVOLUTION Add support for INSERT INTO WITH SCHEMA EVOLUTION to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target, mirroring existing MERGE INTO behavior. Changes: - Add spark.sql.insertNestedTypeCoercion.enabled config flag (default false) - Refactor TableOutputResolver.resolveOutputColumns to accept DefaultValueFillMode enum directly instead of two overlapping boolean parameters - Enable RECURSE mode for V2 inserts when both schema evolution and the config flag are active - Add comprehensive tests for all scenarios
…on coercion Propagate fillDefaultValue through resolveArrayType and resolveMapType by-position paths; use applyColumnMetadata for trailing default fills; clarify Analyzer and SQLConf docs; extend DefaultValueFillMode scaladoc; fix by-name negative test (with USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES disabled) and add by-position array/map nested struct tests.
…s no default When insert nested coercion + schema evolution fills by-position trailing columns/fields, getDefaultValueExprOrNullLit can return None (e.g. nullable column with useNullsForMissingDefaultColumnValues=false and no explicit DEFAULT). The previous flatMap silently dropped those targets; mirror the by-name path by throwing INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA. Add regression tests for top-level and nested struct by-position cases.
…umn drops After fixing trailing by-position default fill (throw when no default), add: - Post-check on resolveColumnsByPosition: result length must match expected arity - enforceFullOutput on reorderColumnsByName and nested struct/array/map resolvers: INSERT (resolveOutputColumns) throws on incomplete resolution; MERGE resolveUpdate keeps enforceFullOutput=false so getOrElse fallback semantics are unchanged. Scalastyle: argcount off for the three nested resolver methods.
Align MERGE assignment resolution with INSERT: resolveStructType/Array/Map now pass enforceFullOutput=true so incomplete nested resolution throws instead of falling back via Option. MergeInto schema evolution suites (824 tests across Group/Delta SQL+Scala) and MergeIntoDataFrameSuite nested struct tests pass.
Revert the 2c56ade change that made incomplete nested resolution throw for MERGE/UPDATE, restoring addError-based cast failure messages in AlignMergeAssignmentsSuite and AlignUpdateAssignmentsSuite. Align two INSERT negative tests with checkError for CANNOT_FIND_DATA.
Member
Author
|
@huaxingao @cloud-fan is it ok for this to go into 4.2 as well? Thanks |
Contributor
|
I plan to cut RC1 tomorrow. Can you evaluate the risk of merging this at the last minute? Would it be okay to wait for 4.2.1? |
cloud-fan
approved these changes
May 20, 2026
Contributor
|
from dev-list, the rc1 cut is delayed to the end of this week, and this PR is needed to fix the behavior of an unreleased feature (flag off by default), so it's nearly no risk. |
Contributor
|
thanks, merging to 4.2! |
cloud-fan
pushed a commit
that referenced
this pull request
May 20, 2026
…NSERT INTO WITH SCHEMA EVOLUTION ## Summary Backport of #55427 to `branch-4.2`. Adds support for `INSERT INTO ... WITH SCHEMA EVOLUTION` to fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target table, mirroring existing `MERGE INTO` behavior gated by `spark.sql.mergeNestedTypeCoercion.enabled`. Key changes: - New config: `spark.sql.insertNestedTypeCoercion.enabled` (internal, default `false`) - Refactor `TableOutputResolver.resolveOutputColumns` to use `DefaultValueFillMode` (`NONE`, `FILL`, `RECURSE`) - Enable `RECURSE` mode for V2 inserts when schema evolution and the coercion flag are both enabled - 17 new tests in `InsertIntoSchemaEvolutionTests` (via `InsertIntoTests.scala`) ## Why are the changes needed? `MERGE INTO` already supports nested type coercion when the source has fewer struct fields than the target. `INSERT INTO WITH SCHEMA EVOLUTION` lacked this capability, causing errors for legitimate schema-evolution workflows where older sources omit newer nested fields. ## Does this PR introduce _any_ user-facing change? Yes. When `spark.sql.insertNestedTypeCoercion.enabled` is set to `true` (default `false`), `INSERT INTO ... WITH SCHEMA EVOLUTION` fills missing nested struct fields with null instead of failing. ## How was this patch tested? Cherry-picked from #55427 onto current `branch-4.2` (`bd8872a0cc7`) with clean cherry-picks (no conflicts). Original PR test plan: - Added comprehensive positive/negative tests in `InsertIntoTests.scala` - All matched tests passed on master ## Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4) Closes #56002 from szehon-ho/insert-schema-evolution-missing-fields-4.2. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backport of #55427 to
branch-4.2.Adds support for
INSERT INTO ... WITH SCHEMA EVOLUTIONto fill missing nested struct fields with null (or column defaults) when the source has fewer fields than the target table, mirroring existingMERGE INTObehavior gated byspark.sql.mergeNestedTypeCoercion.enabled.Key changes:
spark.sql.insertNestedTypeCoercion.enabled(internal, defaultfalse)TableOutputResolver.resolveOutputColumnsto useDefaultValueFillMode(NONE,FILL,RECURSE)RECURSEmode for V2 inserts when schema evolution and the coercion flag are both enabledInsertIntoSchemaEvolutionTests(viaInsertIntoTests.scala)Why are the changes needed?
MERGE INTOalready supports nested type coercion when the source has fewer struct fields than the target.INSERT INTO WITH SCHEMA EVOLUTIONlacked this capability, causing errors for legitimate schema-evolution workflows where older sources omit newer nested fields.Does this PR introduce any user-facing change?
Yes. When
spark.sql.insertNestedTypeCoercion.enabledis set totrue(defaultfalse),INSERT INTO ... WITH SCHEMA EVOLUTIONfills missing nested struct fields with null instead of failing.How was this patch tested?
Cherry-picked from #55427 onto current
branch-4.2(bd8872a0cc7) with clean cherry-picks (no conflicts).Original PR test plan:
InsertIntoTests.scalaWas this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4)