GH-36283: [C++] Fix predicate pushdown for nullable columns with range statistics #48716

rynewang · 2026-01-03T20:55:52Z

Rationale for this change

Predicate pushdown for Parquet was broken for nullable columns with range statistics (min ≠ max),
which is the vast majority of real-world data. This caused row groups to be read even when predicates
could definitively exclude them.

The root cause: Parquet statistics for nullable columns generate guarantees of the form:
or_(and_(field >= min, field <= max), is_null(field))

However, Inequality::ExtractOne() only handled single comparisons inside or_(..., is_null), not
and_(...) expressions. This meant no inequalities were extracted and SimplifyWithGuarantee() could not
simplify predicates.

This affected all predicates on nullable columns:

Comparisons: equal, less, greater, less_equal, greater_equal
Set membership: is_in

See: #36283

What changes are included in this PR?

Added ExpandNullableRangeGuarantees() which transforms:
or_(and_(A, B), is_null(x))
into:
[or_(A, is_null(x)), or_(B, is_null(x))]

This expansion is logically valid because (A ∧ B) ∨ C ≡ (A ∨ C) ∧ (B ∨ C). Each expanded guarantee can
then be processed by existing simplification logic.

Also handles the reversed form or_(is_null(x), and_(...)).

Are these changes tested?

Yes. Added two new test cases:

SimplifyWithNullableRangeGuarantee - tests all comparison operators with nullable range guarantees
SimplifyIsInWithNullableRangeGuarantee - tests is_in with nullable range guarantees

Both tests fail without the fix and pass with it.

Are there any user-facing changes?

No API changes. Users will see improved query performance when filtering nullable columns in Parquet
files, as row groups can now be correctly skipped based on min/max statistics.

GitHub Issue: parquet pushdown predicate dataset.field.isin() much slower than or '|' #36283
Fixes: [Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" iceberg-python#1295

…h range statistics Prior to this change, predicate pushdown was broken for nullable columns with range statistics (min != max). Parquet generates guarantees of the form: or_(and_(field >= min, field <= max), is_null(field)) However, Inequality::ExtractOne() only handled single comparisons inside or_(..., is_null), not and_(...) expressions. This meant no inequalities were extracted and SimplifyWithGuarantee() could not simplify predicates. This affected ALL predicates on nullable columns: - Comparisons: equal, less, greater, less_equal, greater_equal - Set membership: is_in The fix adds ExpandNullableRangeGuarantees() which transforms: or_(and_(A, B), is_null(x)) into: [or_(A, is_null(x)), or_(B, is_null(x))] This expansion is valid because (A AND B) OR C ≡ (A OR C) AND (B OR C). Each expanded guarantee can then be processed by existing simplification logic. Added tests for both comparison operators and is_in with nullable range guarantees.

github-actions · 2026-01-03T20:56:16Z

⚠️ GitHub issue #36283 has been automatically assigned in GitHub to PR creator.

github-actions · 2026-01-03T23:19:13Z

⚠️ GitHub issue #36283 has been automatically assigned in GitHub to PR creator.

github-actions bot added Component: C++ awaiting review Awaiting review labels Jan 3, 2026

rynewang mentioned this pull request Jan 3, 2026

parquet pushdown predicate dataset.field.isin() much slower than or '|' #36283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-36283: [C++] Fix predicate pushdown for nullable columns with range statistics #48716

GH-36283: [C++] Fix predicate pushdown for nullable columns with range statistics #48716

rynewang commented Jan 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GH-36283: [C++] Fix predicate pushdown for nullable columns with range statistics #48716

Are you sure you want to change the base?

GH-36283: [C++] Fix predicate pushdown for nullable columns with range statistics #48716

Conversation

rynewang commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rynewang commented Jan 3, 2026 •

edited

Loading