Skip to content

Eliminate deterministic group by keys with deterministic transformations#20706

Open
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:extend_eliminate_group_by_constant
Open

Eliminate deterministic group by keys with deterministic transformations#20706
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:extend_eliminate_group_by_constant

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Mar 4, 2026

Which issue does this PR close?

Rationale for this change

Make queries go faster like this randomly selected one:

SELECT "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3, COUNT(*) AS c FROM hits GROUP BY "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3 

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Dandandan
Copy link
Contributor Author

run benchmark clickbench_partitioned

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 4, 2026
@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing extend_eliminate_group_by_constant (66c0be5) to d025869 diff using: clickbench_partitioned
Results will be posted here when complete

@Dandandan Dandandan changed the title Eliminate deterministic group by keys Eliminate deterministic group by keys with deterministic transformations Mar 4, 2026
04)------ProjectionExec: expr=[CAST(ClientIP@0 AS Int64) as __common_expr_1, ClientIP@0 as ClientIP, count(Int64(1))@1 as count(Int64(1))]
05)--------AggregateExec: mode=FinalPartitioned, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]
06)----------RepartitionExec: partitioning=Hash([ClientIP@0], 4), input_partitions=1
07)------------AggregateExec: mode=Partial, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thing it not only removes the groupby keys but moves the projections above, minimizing the data usage/shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is pretty clever

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and extend_eliminate_group_by_constant
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ extend_eliminate_group_by_constant ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     2.59 ms │                            2.66 ms │     no change │
│ QQuery 1  │    50.72 ms │                           50.14 ms │     no change │
│ QQuery 2  │   169.89 ms │                          166.64 ms │     no change │
│ QQuery 3  │   171.18 ms │                          172.68 ms │     no change │
│ QQuery 4  │  1116.68 ms │                         1128.04 ms │     no change │
│ QQuery 5  │  1284.36 ms │                         1348.13 ms │     no change │
│ QQuery 6  │     6.91 ms │                            7.47 ms │  1.08x slower │
│ QQuery 7  │    56.24 ms │                           54.30 ms │     no change │
│ QQuery 8  │  1500.74 ms │                         1539.20 ms │     no change │
│ QQuery 9  │  1916.70 ms │                         1915.08 ms │     no change │
│ QQuery 10 │   346.19 ms │                          345.17 ms │     no change │
│ QQuery 11 │   388.59 ms │                          405.59 ms │     no change │
│ QQuery 12 │  1217.38 ms │                         1274.37 ms │     no change │
│ QQuery 13 │  2016.43 ms │                         2079.13 ms │     no change │
│ QQuery 14 │  1218.48 ms │                         1291.54 ms │  1.06x slower │
│ QQuery 15 │  1231.84 ms │                         1300.62 ms │  1.06x slower │
│ QQuery 16 │  2612.25 ms │                         2659.38 ms │     no change │
│ QQuery 17 │  2596.97 ms │                         2627.83 ms │     no change │
│ QQuery 18 │  5808.85 ms │                         5262.26 ms │ +1.10x faster │
│ QQuery 19 │   126.75 ms │                          129.45 ms │     no change │
│ QQuery 20 │  1921.73 ms │                         1986.90 ms │     no change │
│ QQuery 21 │  2224.81 ms │                         2280.39 ms │     no change │
│ QQuery 22 │  3871.75 ms │                         3915.31 ms │     no change │
│ QQuery 23 │ 26771.92 ms │                        11270.95 ms │ +2.38x faster │
│ QQuery 24 │   198.45 ms │                          207.28 ms │     no change │
│ QQuery 25 │   451.06 ms │                          452.95 ms │     no change │
│ QQuery 26 │   212.57 ms │                          205.08 ms │     no change │
│ QQuery 27 │  2731.10 ms │                         2704.78 ms │     no change │
│ QQuery 28 │ 23371.76 ms │                        24374.44 ms │     no change │
│ QQuery 29 │  1057.35 ms │                         1094.92 ms │     no change │
│ QQuery 30 │  1252.16 ms │                         1269.18 ms │     no change │
│ QQuery 31 │  1367.01 ms │                         1393.19 ms │     no change │
│ QQuery 32 │  4957.31 ms │                         4127.53 ms │ +1.20x faster │
│ QQuery 33 │  5946.44 ms │                         5309.57 ms │ +1.12x faster │
│ QQuery 34 │  6191.01 ms │                         5963.83 ms │     no change │
│ QQuery 35 │  2019.73 ms │                         1234.79 ms │ +1.64x faster │
│ QQuery 36 │   185.15 ms │                          193.38 ms │     no change │
│ QQuery 37 │    71.49 ms │                           76.63 ms │  1.07x slower │
│ QQuery 38 │   114.20 ms │                          116.22 ms │     no change │
│ QQuery 39 │   335.72 ms │                          346.31 ms │     no change │
│ QQuery 40 │    40.83 ms │                           41.64 ms │     no change │
│ QQuery 41 │    40.32 ms │                           37.22 ms │ +1.08x faster │
│ QQuery 42 │    31.63 ms │                           30.98 ms │     no change │
└───────────┴─────────────┴────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                                 ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                                 │ 109205.22ms │
│ Total Time (extend_eliminate_group_by_constant)   │  92393.16ms │
│ Average Time (HEAD)                               │   2539.66ms │
│ Average Time (extend_eliminate_group_by_constant) │   2148.68ms │
│ Queries Faster                                    │           6 │
│ Queries Slower                                    │           4 │
│ Queries with No Change                            │          33 │
│ Queries with Failure                              │           0 │
└───────────────────────────────────────────────────┴─────────────┘

@Dandandan
Copy link
Contributor Author

Dandandan commented Mar 4, 2026

│ QQuery 35 │ 2019.73 ms │ 1234.79 ms │ +1.64x faster │

Random improvement!

@Dandandan Dandandan marked this pull request as ready for review March 4, 2026 21:17
@alamb
Copy link
Contributor

alamb commented Mar 4, 2026

Make queries go faster like this randomly selected one:

Now you are quoting the DuckDB guys! We have entered a new era

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan -- this is really nice and clever

I reviewed this code carefully (and also had codex do the same) and I believe it is

  1. Correct
  2. Plausibly useful for more than just ClickBench queries (e.g. for sql out of automated tools)

Specifically I think pure deterministic scalar expressions of a 0 or 1 columns can always be removed

/// Checks if a GROUP BY expression is redundant (can be removed without
/// changing grouping semantics). An expression is redundant if it is a
/// deterministic function of constants and columns already present as bare
/// column references in the GROUP BY.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more common term for this and something that might be more technically precise for "determinsitic function' would be "functionally dependent"

So instead of

An expression is redundant if it is a
deterministic function of constants and columns already present as bare
column references in the GROUP BY."

Maybe osmething like

An expression is redundant if it is a
it is functionally dependent (e.g. a function of constants and columns already present as bare
column references in the GROUP BY."

04)------ProjectionExec: expr=[CAST(ClientIP@0 AS Int64) as __common_expr_1, ClientIP@0 as ClientIP, count(Int64(1))@1 as count(Int64(1))]
05)--------AggregateExec: mode=FinalPartitioned, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]
06)----------RepartitionExec: partitioning=Hash([ClientIP@0], 4), input_partitions=1
07)------------AggregateExec: mode=Partial, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is pretty clever

@alamb alamb added the performance Make DataFusion faster label Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants