Eliminate deterministic group by keys with deterministic transformations#20706
Eliminate deterministic group by keys with deterministic transformations#20706Dandandan wants to merge 2 commits intoapache:mainfrom
Conversation
|
run benchmark clickbench_partitioned |
|
🤖 |
| 04)------ProjectionExec: expr=[CAST(ClientIP@0 AS Int64) as __common_expr_1, ClientIP@0 as ClientIP, count(Int64(1))@1 as count(Int64(1))] | ||
| 05)--------AggregateExec: mode=FinalPartitioned, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))] | ||
| 06)----------RepartitionExec: partitioning=Hash([ClientIP@0], 4), input_partitions=1 | ||
| 07)------------AggregateExec: mode=Partial, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))] |
There was a problem hiding this comment.
Nice thing it not only removes the groupby keys but moves the projections above, minimizing the data usage/shuffle.
|
🤖: Benchmark completed Details
|
|
Random improvement! |
Now you are quoting the DuckDB guys! We have entered a new era |
alamb
left a comment
There was a problem hiding this comment.
Thank you @Dandandan -- this is really nice and clever
I reviewed this code carefully (and also had codex do the same) and I believe it is
- Correct
- Plausibly useful for more than just ClickBench queries (e.g. for sql out of automated tools)
Specifically I think pure deterministic scalar expressions of a 0 or 1 columns can always be removed
| /// Checks if a GROUP BY expression is redundant (can be removed without | ||
| /// changing grouping semantics). An expression is redundant if it is a | ||
| /// deterministic function of constants and columns already present as bare | ||
| /// column references in the GROUP BY. |
There was a problem hiding this comment.
I think a more common term for this and something that might be more technically precise for "determinsitic function' would be "functionally dependent"
So instead of
An expression is redundant if it is a
deterministic function of constants and columns already present as bare
column references in the GROUP BY."
Maybe osmething like
An expression is redundant if it is a
it is functionally dependent (e.g. a function of constants and columns already present as bare
column references in the GROUP BY."
| 04)------ProjectionExec: expr=[CAST(ClientIP@0 AS Int64) as __common_expr_1, ClientIP@0 as ClientIP, count(Int64(1))@1 as count(Int64(1))] | ||
| 05)--------AggregateExec: mode=FinalPartitioned, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))] | ||
| 06)----------RepartitionExec: partitioning=Hash([ClientIP@0], 4), input_partitions=1 | ||
| 07)------------AggregateExec: mode=Partial, gby=[ClientIP@0 as ClientIP], aggr=[count(Int64(1))] |
Which issue does this PR close?
Rationale for this change
Make queries go faster like this randomly selected one:
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?