Skip to content

Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857

Open
RatulDawar wants to merge 4 commits into
apache:mainfrom
RatulDawar:fix/skip-page-index-when-fully-matched
Open

Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857
RatulDawar wants to merge 4 commits into
apache:mainfrom
RatulDawar:fix/skip-page-index-when-fully-matched

Conversation

@RatulDawar

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The Parquet opener was loading the page index (ColumnIndex + OffsetIndex) before row-group statistics pruning. When all surviving row groups are fully matched by row-group statistics (for example, IS NOT NULL on a non-null column), page index I/O cannot prune further and is wasted.

What changes are included in this PR?

  • Reorder the opener state machine: PrepareFilters → PruneWithStatistics → LoadPageIndex? → LoadBloomFilters
  • Skip load_page_index when there is no page-pruning predicate, no surviving row groups, or every surviving row group is fully matched
  • Add unit and integration tests for the gate and the fully-matched IS NOT NULL case

Are these changes tested?

  • cargo test -p datafusion-datasource-parquet should_load
  • cargo test -p datafusion-datasource-parquet page_index_skip
  • cargo test -p datafusion-datasource-parquet opener::test::test_page_pruning
  • cargo test -p datafusion --test parquet_integration
  • cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings

Are there any user-facing changes?

No user-facing API changes. This reduces unnecessary Parquet page index I/O during scan planning when row-group statistics already prove no further pruning is possible.

Made with Cursor

RatulDawar and others added 3 commits June 9, 2026 00:39
…prune.

Reorder the opener so row-group statistics pruning runs before the page
index load, and skip that I/O when every surviving row group is fully matched.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jun 9, 2026
Resolve opener test conflicts after upstream moved opener.rs to opener/mod.rs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@RatulDawar

Copy link
Copy Markdown
Contributor Author

A related question came up while implementing this, can we skip page index I/O per row group (e.g. load index only for RGs that aren't fully matched)?

I checked arrow-rs (parquet 58.3.0 + latest main), but per RG page index apis doesn't seem to be avalible. We can take that implementation as a next setp to this(not sure though per page index skip would be that much beneficial or not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip loading the Parquet page index when row-group statistics already prove it cannot prune

1 participant