fix: UNIQUE constraint with NULLs incorrectly collapses GROUP BY groups#21715
fix: UNIQUE constraint with NULLs incorrectly collapses GROUP BY groups#21715xiedeyantu wants to merge 1 commit into
Conversation
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @xiedeyantu. While this fixes the original issue, it now treats all nulls as different when aggregating, which I don't think is correct. For example:
CREATE TABLE t(a INT, b INT, c INT, UNIQUE(a));
INSERT INTO t VALUES (1, 10, 100), (NULL, 20, 200), (NULL, 30, 300);
SELECT a, SUM(c) AS total FROM t GROUP BY a;
+------+-------+
| a | total |
+------+-------+
| 1 | 100 |
| NULL | 200 |
| NULL | 300 |
+------+-------+
@nuno-faria Thank you for your comprehensive analysis. I have refactored the code and added the test cases you provided; please review it again. |
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @xiedeyantu, I have some extra comments.
cc: @neilconway perhaps you would like to review this as well.
neilconway
left a comment
There was a problem hiding this comment.
What about taking a different approach here: I'd argue a UNIQUE constraint on a nullable column should not be an FD in the first place. Rather than producing such constraints and then filtering them out in the various places that use the FDs, why not change new_from_constraints to not return an FD in this circumstance to begin with?
I've tried making the modifications based on your suggestions; please check if they meet your requirements. |
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @xiedeyantu, LGTM. I leave just some minor requests below.
|
This error should be unrelated to PR. |
|
@nuno-faria Please let me know if there’s anything else I need to do. |
LGTM. @neilconway would you mind reviewing again? |
Thanks for your review @nuno-faria ! Let's wait for @neilconway 's comments. |
neilconway
left a comment
There was a problem hiding this comment.
After this PR, do we still need the concept of a functional dependency with nullable as true? I wonder if we can get eliminate that concept entirely.
| // Recursive queries do not expose the anchor's functional dependencies to | ||
| // the outer schema — the recursive term can produce rows that violate | ||
| // those dependencies, so they are intentionally dropped here. |
There was a problem hiding this comment.
Seems a bit fragile to leave FDs on plan.schema() but only strip them out here when they are being used. Should there be FDs on plan.schema() in the first place?
BTW this might be affected by / related to #22037
|
|
||
| # left semi join should propagate constraint of left side as is. | ||
| query IRR | ||
| # left semi join with a nullable UNIQUE key cannot safely propagate the |
There was a problem hiding this comment.
I wonder if it would be better to mark the UNIQUE column as NOT NULL, so we don't lose the intent of the original test. (Here and below)
@neilconway To be honest, my initial goal was to fix this with minimal changes. Removing nullable would likely require a much wider refactor. I've addressed your other comments, but I'm unsure how to proceed here. Would it be better to leave this for a future cleanup? |
|
@xiedeyantu Yeah, I think it's fine to defer that cleanup to a future PR. Can you resolve the merge conflicts? |
I have resolved the conflict. |
Which issue does this PR close?
Rationale for this change
DataFusion was deriving functional dependencies too aggressively in two cases:
UNIQUEconstraints were being treated as functional dependencies even when one or more key columns were nullable. Under SQL semantics,UNIQUEallows multipleNULLvalues, so such constraints do not guarantee row-level uniqueness and cannot be used to eliminateGROUP BYexpressions safely.Recursive CTEs could inherit functional dependencies from the static term (anchor) even though the recursive term may produce rows that violate those dependencies. This could cause downstream optimizer rules to make incorrect uniqueness-based rewrites.
What changes are included in this PR?
This PR tightens functional-dependency derivation and propagation:
FunctionalDependencies::new_from_constraintsnow takes per-field nullability and only derives a functional dependency from aUNIQUEconstraint when all determinant columns are non-nullable.UNIQUEconstraints are omitted entirely instead of being recorded as nullable functional dependencies.GROUP BYare preserved so they can still be used safely by later planning stages.GROUP BYover nullableUNIQUEcolumns with multipleNULLvaluesFunctionalDependencies::new_from_constraintsAPI change.Are these changes tested?
Yes. This PR adds regression coverage for nullable-
UNIQUEGROUP BYcases, recursive CTE functional-dependency propagation, and join cases that distinguish nullable from non-null unique keys.Are there any user-facing changes?
Yes. Queries that rely on grouping or distinctness analysis will no longer assume uniqueness from nullable
UNIQUEconstraints, and recursive CTE outputs will no longer inherit invalid functional dependencies from the anchor term. This prevents incorrect optimizer rewrites and fixes wrong results in affected queries.