-
Notifications
You must be signed in to change notification settings - Fork 2k
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge?
Current parelllization of Parquet scan is bounded by the thread that has the most data / is the slowest to execute, which means in the case of data skew (driven by either larger partitions or less selective filters during pruning / filter pushdown..., variable object store latency), the parallelism will be significantly limited.
Describe the solution you'd like
We can change the strategy by morsel-driven parallelism like described in https://db.in.tum.de/~leis/papers/morsels.pdf.
Doing so is faster for a lot of queries, when there is an amount of skew (such as clickbench) and we have enough row filters to spread out the work.
For clickbench_partitioned / clickbench_pushdown it seems up to ~2x as fast for some queries, on a 10 core machine.
Describe alternatives you've considered
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster