[SPARK-56942][SQL] Widen DSv2 row-id resolution to support nested columns#55981
Open
xupefei wants to merge 2 commits into
Open
[SPARK-56942][SQL] Widen DSv2 row-id resolution to support nested columns#55981xupefei wants to merge 2 commits into
xupefei wants to merge 2 commits into
Conversation
…umns DSv2 connectors that implement SupportsDelta currently must use a top-level column for `rowId()`. If a connector returns a multi-segment field reference (e.g. a nested struct field, or `_metadata.row_index` on a file-source-backed table), analysis fails with a ClassCastException because Spark calls `V2ExpressionUtils.resolveRefs[AttributeReference]`, while nested references resolve to `Alias(GetStructField(...))`. Widen `RewriteRowLevelCommand.resolveRowIdAttrs` and `WriteDelta.rowIdAttrsResolved` to resolve as `NamedExpression` and flatten back via `.toAttribute`. This supports both flat and nested row-id columns; flat-column behavior is unchanged. Tests added in `V2ExpressionUtilsSuite` cover the nested case, the flat case, and demonstrate that the previous `[AttributeReference]` cast would throw for nested references. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
DSv2 connectors that implement SupportsDelta currently must use a top-level column for
rowId(). If a connector returns a multi-segment field reference (e.g. a nested struct field, or_metadata.row_indexon a file-source-backed table), analysis fails with a ClassCastException because Spark callsV2ExpressionUtils.resolveRefs[AttributeReference], while nested references resolve toAlias(GetStructField(...)).Widen
RewriteRowLevelCommand.resolveRowIdAttrsandWriteDelta.rowIdAttrsResolvedto resolve asNamedExpressionand flatten back via.toAttribute. This supports both flat and nested row-id columns; flat-column behavior is unchanged.Why are the changes needed?
To unblocks Delta Lake DSv2 connectors that identify rows by file-source metadata such as
(_metadata.file_path, _metadata.row_index).Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests are added to make sure it works for nested and flat cases.
Was this patch authored or co-authored using generative AI tooling?
Yes, generated by Claude.