[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset by shrirangmhalgi · Pull Request #55905 · apache/spark

shrirangmhalgi · 2026-05-15T16:31:42Z

What changes were proposed in this pull request?

Reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the "Replace Operators" optimizer batch.

Why are the changes needed?

dropDuplicates("id", "name").exceptAll(other) throws INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. The root cause is that RewriteExceptAll captures attribute references from left.output before ReplaceDeduplicateWithAggregate has replaced the Deduplicate node with an Aggregate(First(...)). The First() alias creates new exprIds that don't match what RewriteExceptAll baked into its Generate node.

Does this PR introduce any user-facing change?

Yes. exceptAll (and intersectAll) now work correctly after dropDuplicates with a column subset.

How was this patch tested?

Added a test in DataFrameSetOperationsSuite verifying exceptAll, except, and intersectAll after dropDuplicates(subset).

Was this patch authored or co-authored using generative AI tooling?

Yes.

ReplaceDeduplicateWithAggregate replaces Deduplicate with an Aggregate using First() for non-key columns, creating new attribute exprIds. When RewriteExceptAll ran first in the same optimizer batch, it captured the original exprIds in its Generate node. After ReplaceDeduplicateWithAggregate rewrote the Deduplicate, the Generate still referenced the old exprIds, causing INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. Fix: reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the Replace Operators batch so Deduplicate is already an Aggregate when RewriteExceptAll processes the plan.

shrirangmhalgi · 2026-05-15T16:40:29Z

@holdenk / @dongjoon-hyun Could you please review

holdenk · 2026-05-15T17:10:18Z

      ReplaceExceptWithFilter,
      ReplaceExceptWithAntiJoin,
-      ReplaceDistinctWithAggregate,
-      ReplaceDeduplicateWithAggregate),


Can we document this dependency relation?

Thank you @holdenk for the review. I added a comment explaining the dependency.

acruise · 2026-05-15T17:38:59Z

+    assert(result.count() === 1)
+    assert(result.collect().head.getInt(0) === 2)
+
+    // Also verify except (non-all) works
+    val result2 = deduped.except(df2)
+    assert(result2.count() === 1)
+
+    // intersectAll should also work
+    val result3 = deduped.intersectAll(df2)
+    assert(result3.count() <= 1)


A bit silly but it might be nice to check that the correct values survive, not just the expected number of values ;)

Good call - updated the test to assert actual row values (id, name, value) for all three operations. Thanks!

I realized that the test dataframe can cause non deterministic rows to be picked up. To avoid test flakiness i modified the test data to produce deterministic results keeping all the dataframe rows unique. 😊

holdenk reviewed May 15, 2026

View reviewed changes

acruise reviewed May 15, 2026

View reviewed changes

shrirangmhalgi added 2 commits May 15, 2026 12:15

Address review: Add dependency comment and strengthen test assertions

5f80a43

Use unique test data to avoid non-deterministic First() behavior

b38ca02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-51262-except-all-not-working-with-drop-duplicates

shrirangmhalgi commented May 15, 2026

Uh oh!

shrirangmhalgi commented May 15, 2026

Uh oh!

holdenk May 15, 2026

Uh oh!

shrirangmhalgi May 15, 2026 •

edited

Loading

Uh oh!

acruise May 15, 2026 •

edited

Loading

Uh oh!

shrirangmhalgi May 15, 2026

Uh oh!

shrirangmhalgi May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shrirangmhalgi commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

shrirangmhalgi commented May 15, 2026

Uh oh!

holdenk May 15, 2026

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acruise May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shrirangmhalgi May 15, 2026 •

edited

Loading

acruise May 15, 2026 •

edited

Loading

shrirangmhalgi May 15, 2026 •

edited

Loading