GH-36283: [C++] Fix predicate pushdown for nullable columns with range statistics #48716
+195
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Predicate pushdown for Parquet was broken for nullable columns with range statistics (min ≠ max),
which is the vast majority of real-world data. This caused row groups to be read even when predicates
could definitively exclude them.
The root cause: Parquet statistics for nullable columns generate guarantees of the form:
or_(and_(field >= min, field <= max), is_null(field))
However, Inequality::ExtractOne() only handled single comparisons inside or_(..., is_null), not
and_(...) expressions. This meant no inequalities were extracted and SimplifyWithGuarantee() could not
simplify predicates.
This affected all predicates on nullable columns:
See: #36283
What changes are included in this PR?
Added ExpandNullableRangeGuarantees() which transforms:
or_(and_(A, B), is_null(x))
into:
[or_(A, is_null(x)), or_(B, is_null(x))]
This expansion is logically valid because (A ∧ B) ∨ C ≡ (A ∨ C) ∧ (B ∨ C). Each expanded guarantee can
then be processed by existing simplification logic.
Also handles the reversed form or_(is_null(x), and_(...)).
Are these changes tested?
Yes. Added two new test cases:
Both tests fail without the fix and pass with it.
Are there any user-facing changes?
No API changes. Users will see improved query performance when filtering nullable columns in Parquet
files, as row groups can now be correctly skipped based on min/max statistics.
Fixes: [Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" iceberg-python#1295