Eliminate deterministic group by keys with deterministic transformations by Dandandan · Pull Request #20706 · apache/datafusion

Dandandan · 2026-03-04T20:42:36Z

Which issue does this PR close?

Helps with [EPIC] Make DataFusion the top of the ClickBench Parquet leaderboard #18489

Rationale for this change

Make queries go faster like this randomly selected one:

SELECT "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3, COUNT(*) AS c FROM hits GROUP BY "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan · 2026-03-04T20:42:46Z

run benchmark clickbench_partitioned

alamb-ghbot · 2026-03-04T20:42:58Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing extend_eliminate_group_by_constant (66c0be5) to d025869 diff using: clickbench_partitioned
Results will be posted here when complete

Dandandan · 2026-03-04T20:55:31Z

datafusion/sqllogictest/test_files/clickbench.slt

+04)------ProjectionExec: expr=[CAST(ClientIP@0 AS Int64) as __common_expr_1, ClientIP@0 as ClientIP, count(Int64(1))@1 as count(Int64(1))]
+05)--------AggregateExec: mode=FinalPartitioned, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]
+06)----------RepartitionExec: partitioning=Hash([ClientIP@0], 4), input_partitions=1
+07)------------AggregateExec: mode=Partial, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]


Nice thing it not only removes the groupby keys but moves the projections above, minimizing the data usage/shuffle.

yeah this is pretty clever

alamb-ghbot · 2026-03-04T21:16:15Z

🤖: Benchmark completed

Details

Comparing HEAD and extend_eliminate_group_by_constant
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ extend_eliminate_group_by_constant ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     2.59 ms │                            2.66 ms │     no change │
│ QQuery 1  │    50.72 ms │                           50.14 ms │     no change │
│ QQuery 2  │   169.89 ms │                          166.64 ms │     no change │
│ QQuery 3  │   171.18 ms │                          172.68 ms │     no change │
│ QQuery 4  │  1116.68 ms │                         1128.04 ms │     no change │
│ QQuery 5  │  1284.36 ms │                         1348.13 ms │     no change │
│ QQuery 6  │     6.91 ms │                            7.47 ms │  1.08x slower │
│ QQuery 7  │    56.24 ms │                           54.30 ms │     no change │
│ QQuery 8  │  1500.74 ms │                         1539.20 ms │     no change │
│ QQuery 9  │  1916.70 ms │                         1915.08 ms │     no change │
│ QQuery 10 │   346.19 ms │                          345.17 ms │     no change │
│ QQuery 11 │   388.59 ms │                          405.59 ms │     no change │
│ QQuery 12 │  1217.38 ms │                         1274.37 ms │     no change │
│ QQuery 13 │  2016.43 ms │                         2079.13 ms │     no change │
│ QQuery 14 │  1218.48 ms │                         1291.54 ms │  1.06x slower │
│ QQuery 15 │  1231.84 ms │                         1300.62 ms │  1.06x slower │
│ QQuery 16 │  2612.25 ms │                         2659.38 ms │     no change │
│ QQuery 17 │  2596.97 ms │                         2627.83 ms │     no change │
│ QQuery 18 │  5808.85 ms │                         5262.26 ms │ +1.10x faster │
│ QQuery 19 │   126.75 ms │                          129.45 ms │     no change │
│ QQuery 20 │  1921.73 ms │                         1986.90 ms │     no change │
│ QQuery 21 │  2224.81 ms │                         2280.39 ms │     no change │
│ QQuery 22 │  3871.75 ms │                         3915.31 ms │     no change │
│ QQuery 23 │ 26771.92 ms │                        11270.95 ms │ +2.38x faster │
│ QQuery 24 │   198.45 ms │                          207.28 ms │     no change │
│ QQuery 25 │   451.06 ms │                          452.95 ms │     no change │
│ QQuery 26 │   212.57 ms │                          205.08 ms │     no change │
│ QQuery 27 │  2731.10 ms │                         2704.78 ms │     no change │
│ QQuery 28 │ 23371.76 ms │                        24374.44 ms │     no change │
│ QQuery 29 │  1057.35 ms │                         1094.92 ms │     no change │
│ QQuery 30 │  1252.16 ms │                         1269.18 ms │     no change │
│ QQuery 31 │  1367.01 ms │                         1393.19 ms │     no change │
│ QQuery 32 │  4957.31 ms │                         4127.53 ms │ +1.20x faster │
│ QQuery 33 │  5946.44 ms │                         5309.57 ms │ +1.12x faster │
│ QQuery 34 │  6191.01 ms │                         5963.83 ms │     no change │
│ QQuery 35 │  2019.73 ms │                         1234.79 ms │ +1.64x faster │
│ QQuery 36 │   185.15 ms │                          193.38 ms │     no change │
│ QQuery 37 │    71.49 ms │                           76.63 ms │  1.07x slower │
│ QQuery 38 │   114.20 ms │                          116.22 ms │     no change │
│ QQuery 39 │   335.72 ms │                          346.31 ms │     no change │
│ QQuery 40 │    40.83 ms │                           41.64 ms │     no change │
│ QQuery 41 │    40.32 ms │                           37.22 ms │ +1.08x faster │
│ QQuery 42 │    31.63 ms │                           30.98 ms │     no change │
└───────────┴─────────────┴────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                                 ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                                 │ 109205.22ms │
│ Total Time (extend_eliminate_group_by_constant)   │  92393.16ms │
│ Average Time (HEAD)                               │   2539.66ms │
│ Average Time (extend_eliminate_group_by_constant) │   2148.68ms │
│ Queries Faster                                    │           6 │
│ Queries Slower                                    │           4 │
│ Queries with No Change                            │          33 │
│ Queries with Failure                              │           0 │
└───────────────────────────────────────────────────┴─────────────┘

Dandandan · 2026-03-04T21:17:27Z

│ QQuery 35 │ 2019.73 ms │ 1234.79 ms │ +1.64x faster │

Random improvement!

alamb · 2026-03-04T21:36:20Z

Make queries go faster like this randomly selected one:

Now you are quoting the DuckDB guys! We have entered a new era

alamb

Thank you @Dandandan -- this is really nice and clever

I reviewed this code carefully (and also had codex do the same) and I believe it is

Correct
Plausibly useful for more than just ClickBench queries (e.g. for sql out of automated tools)

Specifically I think pure deterministic scalar expressions of a 0 or 1 columns can always be removed

alamb · 2026-03-04T21:45:16Z

datafusion/optimizer/src/eliminate_group_by_constant.rs

+/// Checks if a GROUP BY expression is redundant (can be removed without
+/// changing grouping semantics). An expression is redundant if it is a
+/// deterministic function of constants and columns already present as bare
+/// column references in the GROUP BY.


I think a more common term for this and something that might be more technically precise for "determinsitic function' would be "functionally dependent"

So instead of

An expression is redundant if it is a
deterministic function of constants and columns already present as bare
column references in the GROUP BY."

Maybe osmething like

An expression is redundant if it is a
it is functionally dependent (e.g. a function of constants and columns already present as bare
column references in the GROUP BY."

alamb · 2026-03-04T21:48:35Z

datafusion/sqllogictest/test_files/clickbench.slt

+04)------ProjectionExec: expr=[CAST(ClientIP@0 AS Int64) as __common_expr_1, ClientIP@0 as ClientIP, count(Int64(1))@1 as count(Int64(1))]
+05)--------AggregateExec: mode=FinalPartitioned, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]
+06)----------RepartitionExec: partitioning=Hash([ClientIP@0], 4), input_partitions=1
+07)------------AggregateExec: mode=Partial, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]


yeah this is pretty clever

Eliminate group by (a-1), (a - 2)

66c0be5

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 4, 2026

Fmt

c22396a

Dandandan changed the title ~~Eliminate deterministic group by keys~~ Eliminate deterministic group by keys with deterministic transformations Mar 4, 2026

Dandandan commented Mar 4, 2026

View reviewed changes

Dandandan marked this pull request as ready for review March 4, 2026 21:17

alamb approved these changes Mar 4, 2026

View reviewed changes

alamb added the performance Make DataFusion faster label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate deterministic group by keys with deterministic transformations#20706

Eliminate deterministic group by keys with deterministic transformations#20706
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:extend_eliminate_group_by_constant

Dandandan commented Mar 4, 2026 •

edited

Loading

Uh oh!

Dandandan commented Mar 4, 2026

Uh oh!

alamb-ghbot commented Mar 4, 2026

Uh oh!

Dandandan Mar 4, 2026

Uh oh!

alamb Mar 4, 2026

Uh oh!

alamb-ghbot commented Mar 4, 2026

Uh oh!

Dandandan commented Mar 4, 2026 •

edited

Loading

Uh oh!

alamb commented Mar 4, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 4, 2026

Uh oh!

alamb Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Dandandan commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Mar 4, 2026

Uh oh!

alamb-ghbot commented Mar 4, 2026

Uh oh!

Dandandan Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

alamb-ghbot commented Mar 4, 2026

Uh oh!

Dandandan commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 4, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dandandan commented Mar 4, 2026 •

edited

Loading

Dandandan commented Mar 4, 2026 •

edited

Loading