Skip to content

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905

Open
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-51262-except-all-not-working-with-drop-duplicates
Open

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-51262-except-all-not-working-with-drop-duplicates

Conversation

@shrirangmhalgi
Copy link
Copy Markdown

What changes were proposed in this pull request?

Reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the "Replace Operators" optimizer batch.

Why are the changes needed?

dropDuplicates("id", "name").exceptAll(other) throws INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. The root cause is that RewriteExceptAll captures attribute references from left.output before ReplaceDeduplicateWithAggregate has replaced the Deduplicate node with an Aggregate(First(...)). The First() alias creates new exprIds that don't match what RewriteExceptAll baked into its Generate node.

Does this PR introduce any user-facing change?

Yes. exceptAll (and intersectAll) now work correctly after dropDuplicates with a column subset.

How was this patch tested?

Added a test in DataFrameSetOperationsSuite verifying exceptAll, except, and intersectAll after dropDuplicates(subset).

Was this patch authored or co-authored using generative AI tooling?

Yes.

ReplaceDeduplicateWithAggregate replaces Deduplicate with an Aggregate using First() for non-key columns, creating new attribute exprIds. When RewriteExceptAll ran first in the same optimizer batch, it captured the original exprIds in its Generate node. After ReplaceDeduplicateWithAggregate rewrote the Deduplicate, the Generate still referenced the old exprIds, causing INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time.

Fix: reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the Replace Operators batch so Deduplicate is already an Aggregate when RewriteExceptAll processes the plan.
@shrirangmhalgi
Copy link
Copy Markdown
Author

@holdenk / @dongjoon-hyun Could you please review

ReplaceExceptWithFilter,
ReplaceExceptWithAntiJoin,
ReplaceDistinctWithAggregate,
ReplaceDeduplicateWithAggregate),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this dependency relation?

Copy link
Copy Markdown
Author

@shrirangmhalgi shrirangmhalgi May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @holdenk for the review. I added a comment explaining the dependency.

Comment on lines +1663 to +1672
assert(result.count() === 1)
assert(result.collect().head.getInt(0) === 2)

// Also verify except (non-all) works
val result2 = deduped.except(df2)
assert(result2.count() === 1)

// intersectAll should also work
val result3 = deduped.intersectAll(df2)
assert(result3.count() <= 1)
Copy link
Copy Markdown

@acruise acruise May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit silly but it might be nice to check that the correct values survive, not just the expected number of values ;)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - updated the test to assert actual row values (id, name, value) for all three operations. Thanks!

Copy link
Copy Markdown
Author

@shrirangmhalgi shrirangmhalgi May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the test dataframe can cause non deterministic rows to be picked up. To avoid test flakiness i modified the test data to produce deterministic results keeping all the dataframe rows unique. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants