Skip to content

fix: UNIQUE constraint with NULLs incorrectly collapses GROUP BY groups#21715

Open
xiedeyantu wants to merge 1 commit into
apache:mainfrom
xiedeyantu:fd
Open

fix: UNIQUE constraint with NULLs incorrectly collapses GROUP BY groups#21715
xiedeyantu wants to merge 1 commit into
apache:mainfrom
xiedeyantu:fd

Conversation

@xiedeyantu
Copy link
Copy Markdown
Member

@xiedeyantu xiedeyantu commented Apr 18, 2026

Which issue does this PR close?

Rationale for this change

DataFusion was deriving functional dependencies too aggressively in two cases:

  1. UNIQUE constraints were being treated as functional dependencies even when one or more key columns were nullable. Under SQL semantics, UNIQUE allows multiple NULL values, so such constraints do not guarantee row-level uniqueness and cannot be used to eliminate GROUP BY expressions safely.

  2. Recursive CTEs could inherit functional dependencies from the static term (anchor) even though the recursive term may produce rows that violate those dependencies. This could cause downstream optimizer rules to make incorrect uniqueness-based rewrites.

What changes are included in this PR?

This PR tightens functional-dependency derivation and propagation:

  • FunctionalDependencies::new_from_constraints now takes per-field nullability and only derives a functional dependency from a UNIQUE constraint when all determinant columns are non-nullable.
  • Nullable UNIQUE constraints are omitted entirely instead of being recorded as nullable functional dependencies.
  • Aggregate output functional dependencies for GROUP BY are preserved so they can still be used safely by later planning stages.
  • Recursive CTE plans now clear inherited functional dependencies from the anchor schema, so outer consumers do not assume uniqueness that the recursive term may break.
  • Added regression coverage for:
    • GROUP BY over nullable UNIQUE columns with multiple NULL values
    • join cases where non-null unique keys can still support expansion, but nullable unique keys cannot
    • recursive CTEs where anchor-term uniqueness must not leak into the full recursive result
  • Added an upgrade note for the FunctionalDependencies::new_from_constraints API change.

Are these changes tested?

Yes. This PR adds regression coverage for nullable-UNIQUE GROUP BY cases, recursive CTE functional-dependency propagation, and join cases that distinguish nullable from non-null unique keys.

Are there any user-facing changes?

Yes. Queries that rely on grouping or distinctness analysis will no longer assume uniqueness from nullable UNIQUE constraints, and recursive CTE outputs will no longer inherit invalid functional dependencies from the anchor term. This prevents incorrect optimizer rewrites and fixes wrong results in affected queries.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate logical-expr Logical plan and expressions substrait Changes to the substrait crate labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiedeyantu. While this fixes the original issue, it now treats all nulls as different when aggregating, which I don't think is correct. For example:

CREATE TABLE t(a INT, b INT, c INT, UNIQUE(a));
INSERT INTO t VALUES (1, 10, 100), (NULL, 20, 200), (NULL, 30, 300);
SELECT a, SUM(c) AS total FROM t GROUP BY a;
+------+-------+
| a    | total |
+------+-------+
| 1    | 100   |
| NULL | 200   |
| NULL | 300   |
+------+-------+

@github-actions github-actions Bot added sql SQL Planner and removed substrait Changes to the substrait crate labels Apr 19, 2026
@xiedeyantu
Copy link
Copy Markdown
Member Author

Thanks @xiedeyantu. While this fixes the original issue, it now treats all nulls as different when aggregating, which I don't think is correct. For example:

CREATE TABLE t(a INT, b INT, c INT, UNIQUE(a));
INSERT INTO t VALUES (1, 10, 100), (NULL, 20, 200), (NULL, 30, 300);
SELECT a, SUM(c) AS total FROM t GROUP BY a;
+------+-------+
| a    | total |
+------+-------+
| 1    | 100   |
| NULL | 200   |
| NULL | 300   |
+------+-------+

@nuno-faria Thank you for your comprehensive analysis. I have refactored the code and added the test cases you provided; please review it again.

Copy link
Copy Markdown
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiedeyantu, I have some extra comments.
cc: @neilconway perhaps you would like to review this as well.

Comment thread datafusion/expr/src/logical_plan/plan.rs Outdated
Comment thread datafusion/sql/src/select.rs
Comment thread datafusion/sqllogictest/test_files/group_by.slt
Copy link
Copy Markdown
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about taking a different approach here: I'd argue a UNIQUE constraint on a nullable column should not be an FD in the first place. Rather than producing such constraints and then filtering them out in the various places that use the FDs, why not change new_from_constraints to not return an FD in this circumstance to begin with?

Comment thread datafusion/sqllogictest/test_files/group_by.slt
@github-actions github-actions Bot added optimizer Optimizer rules and removed sql SQL Planner labels May 5, 2026
@xiedeyantu
Copy link
Copy Markdown
Member Author

What about taking a different approach here: I'd argue a UNIQUE constraint on a nullable column should not be an FD in the first place. Rather than producing such constraints and then filtering them out in the various places that use the FDs, why not change new_from_constraints to not return an FD in this circumstance to begin with?

I've tried making the modifications based on your suggestions; please check if they meet your requirements.

Copy link
Copy Markdown
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiedeyantu, LGTM. I leave just some minor requests below.

Comment thread datafusion/common/src/functional_dependencies.rs
Comment thread datafusion/sqllogictest/test_files/cte.slt Outdated
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 9, 2026
@xiedeyantu
Copy link
Copy Markdown
Member Author

This error should be unrelated to PR.

Run cargo test --profile ci -p datafusion-ffi --lib --tests --features integration-tests
  cargo test --profile ci -p datafusion-ffi --lib --tests --features integration-tests
  shell: /bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    CACHE_ON_FAILURE: false
    CARGO_INCREMENTAL: 0
    RUST_BACKTRACE: 1
    RUSTFLAGS: -C debuginfo=line-tables-only -C incremental=false
error: error: unexpected argument 'test' found

Usage: rustup-init[EXE] [OPTIONS]

For more information, try '--help'.


Stack backtrace:
   0: <std::backtrace::Backtrace>::capture
   1: anyhow::kind::Adhoc::new
   2: rustup::cli::setup_mode::main::{{closure}}::{{closure}}
   3: rustup_init::run_rustup_inner::{{closure}}::{{closure}}
   4: rustup_init::run_rustup::{{closure}}::{{closure}}
   5: rustup_init::main::{{closure}}
   6: tokio::runtime::runtime::Runtime::block_on
   7: rustup_init::main
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: _main

@xiedeyantu
Copy link
Copy Markdown
Member Author

@nuno-faria Please let me know if there’s anything else I need to do.

@nuno-faria
Copy link
Copy Markdown
Contributor

@nuno-faria Please let me know if there’s anything else I need to do.

LGTM. @neilconway would you mind reviewing again?

@xiedeyantu
Copy link
Copy Markdown
Member Author

@nuno-faria Please let me know if there’s anything else I need to do.

LGTM. @neilconway would you mind reviewing again?

Thanks for your review @nuno-faria ! Let's wait for @neilconway 's comments.

Copy link
Copy Markdown
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, do we still need the concept of a functional dependency with nullable as true? I wonder if we can get eliminate that concept entirely.

Comment thread datafusion/common/src/functional_dependencies.rs
Comment thread datafusion/expr/src/logical_plan/plan.rs Outdated
Comment thread datafusion/common/src/functional_dependencies.rs Outdated
Comment thread datafusion/expr/src/logical_plan/plan.rs
Comment on lines +2440 to +2442
// Recursive queries do not expose the anchor's functional dependencies to
// the outer schema — the recursive term can produce rows that violate
// those dependencies, so they are intentionally dropped here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit fragile to leave FDs on plan.schema() but only strip them out here when they are being used. Should there be FDs on plan.schema() in the first place?

BTW this might be affected by / related to #22037


# left semi join should propagate constraint of left side as is.
query IRR
# left semi join with a nullable UNIQUE key cannot safely propagate the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to mark the UNIQUE column as NOT NULL, so we don't lose the intent of the original test. (Here and below)

@github-actions github-actions Bot added the proto Related to proto crate label May 19, 2026
@xiedeyantu
Copy link
Copy Markdown
Member Author

After this PR, do we still need the concept of a functional dependency with nullable as true? I wonder if we can get eliminate that concept entirely.

@neilconway To be honest, my initial goal was to fix this with minimal changes. Removing  nullable  would likely require a much wider refactor. I've addressed your other comments, but I'm unsure how to proceed here. Would it be better to leave this for a future cleanup?

@neilconway
Copy link
Copy Markdown
Contributor

@xiedeyantu Yeah, I think it's fine to defer that cleanup to a future PR. Can you resolve the merge conflicts?

@xiedeyantu
Copy link
Copy Markdown
Member Author

@xiedeyantu Yeah, I think it's fine to defer that cleanup to a future PR. Can you resolve the merge conflicts?

I have resolved the conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect query results for GROUP BY with UNIQUE constraint

3 participants