Skip to content

Conversation

@rynewang
Copy link

@rynewang rynewang commented Jan 3, 2026

Rationale for this change

Predicate pushdown for Parquet was broken for nullable columns with range statistics (min ≠ max),
which is the vast majority of real-world data. This caused row groups to be read even when predicates
could definitively exclude them.

The root cause: Parquet statistics for nullable columns generate guarantees of the form:
or_(and_(field >= min, field <= max), is_null(field))

However, Inequality::ExtractOne() only handled single comparisons inside or_(..., is_null), not
and_(...) expressions. This meant no inequalities were extracted and SimplifyWithGuarantee() could not
simplify predicates.

This affected all predicates on nullable columns:

  • Comparisons: equal, less, greater, less_equal, greater_equal
  • Set membership: is_in

See: #36283

What changes are included in this PR?

Added ExpandNullableRangeGuarantees() which transforms:
or_(and_(A, B), is_null(x))
into:
[or_(A, is_null(x)), or_(B, is_null(x))]

This expansion is logically valid because (A ∧ B) ∨ C ≡ (A ∨ C) ∧ (B ∨ C). Each expanded guarantee can
then be processed by existing simplification logic.

Also handles the reversed form or_(is_null(x), and_(...)).

Are these changes tested?

Yes. Added two new test cases:

  • SimplifyWithNullableRangeGuarantee - tests all comparison operators with nullable range guarantees
  • SimplifyIsInWithNullableRangeGuarantee - tests is_in with nullable range guarantees

Both tests fail without the fix and pass with it.

Are there any user-facing changes?

No API changes. Users will see improved query performance when filtering nullable columns in Parquet
files, as row groups can now be correctly skipped based on min/max statistics.

…h range statistics

Prior to this change, predicate pushdown was broken for nullable columns
with range statistics (min != max). Parquet generates guarantees of the form:

  or_(and_(field >= min, field <= max), is_null(field))

However, Inequality::ExtractOne() only handled single comparisons inside
or_(..., is_null), not and_(...) expressions. This meant no inequalities
were extracted and SimplifyWithGuarantee() could not simplify predicates.

This affected ALL predicates on nullable columns:
- Comparisons: equal, less, greater, less_equal, greater_equal
- Set membership: is_in

The fix adds ExpandNullableRangeGuarantees() which transforms:
  or_(and_(A, B), is_null(x))
into:
  [or_(A, is_null(x)), or_(B, is_null(x))]

This expansion is valid because (A AND B) OR C ≡ (A OR C) AND (B OR C).
Each expanded guarantee can then be processed by existing simplification
logic.

Added tests for both comparison operators and is_in with nullable range
guarantees.
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

⚠️ GitHub issue #36283 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

⚠️ GitHub issue #36283 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo"

1 participant