Skip to content

physical_optimizer: preserve_file_partitions when num file groups < target_partitions#21533

Open
jayshrivastava wants to merge 2 commits intoapache:mainfrom
jayshrivastava:js/remove-repartition
Open

physical_optimizer: preserve_file_partitions when num file groups < target_partitions#21533
jayshrivastava wants to merge 2 commits intoapache:mainfrom
jayshrivastava:js/remove-repartition

Conversation

@jayshrivastava
Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava commented Apr 10, 2026

Rationale for this change

datafusion.optimizer.preserve_file_partitions would not actually preserve the file partitions when the number of file groups is less than the target_partitions. This is unexpected behavior. If a user wants to preserve file partitions, it is because they want to avoid repartitions.

Before

  ProjectionExec
    AggregateExec: mode=FinalPartitioned, gby=[f_dkey, timestamp]
      RepartitionExec: partitioning=Hash([f_dkey, timestamp], 4), input_partitions=3
        AggregateExec: mode=Partial, gby=[f_dkey, timestamp]
          DataSourceExec: file_groups=3, projection=[timestamp, value, f_dkey]

After

  ProjectionExec
    AggregateExec: mode=SinglePartitioned, gby=[f_dkey, timestamp]
      DataSourceExec: file_groups=3, projection=[timestamp, value, f_dkey]

What changes are included in this PR?

This change fixes that by updating 1 line in the enforce_distribution optimizer rule.

Are these changes tested?

Yes. In the preserve_file_partitions SLT.

Are there any user-facing changes?

No.

…arget_partitions

`datafusion.optimizer.preserve_file_partitions` would not actually preserve the file partitions when the number of file groups is less than the target_partitions. This is unexpected behavior. If a user wants to preserve file partitions, it is because they want to avoid repartitions.

This change fixes that by updating 1 line in the `enforce_distribution` optimizer rule. It also adds a regression test.

Before
```
  ProjectionExec
    AggregateExec: mode=FinalPartitioned, gby=[f_dkey, timestamp]
      RepartitionExec: partitioning=Hash([f_dkey, timestamp], 4), input_partitions=3
        AggregateExec: mode=Partial, gby=[f_dkey, timestamp]
          DataSourceExec: file_groups=3, projection=[timestamp, value, f_dkey]
```
After
```
  ProjectionExec
    AggregateExec: mode=SinglePartitioned, gby=[f_dkey, timestamp]
      DataSourceExec: file_groups=3, projection=[timestamp, value, f_dkey]
```
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 10, 2026
@jayshrivastava jayshrivastava marked this pull request as ready for review April 10, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant