Skip to content

Introduce morsel-driven Parquet scan #20529

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge?

Current parelllization of Parquet scan is bounded by the thread that has the most data / is the slowest to execute, which means in the case of data skew (driven by either larger partitions or less selective filters during pruning / filter pushdown..., variable object store latency), the parallelism will be significantly limited.

Describe the solution you'd like

We can change the strategy by morsel-driven parallelism like described in https://db.in.tum.de/~leis/papers/morsels.pdf.

Doing so is faster for a lot of queries, when there is an amount of skew (such as clickbench) and we have enough row filters to spread out the work.
For clickbench_partitioned / clickbench_pushdown it seems up to ~2x as fast for some queries, on a 10 core machine.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions