-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Description
Summary
Part 4 of ORC predicate pushdown (#48986). Depends on #49361.
Add a filters parameter to pyarrow.orc.read_table() for API parity with Parquet's read_table(). This makes ORC predicate pushdown accessible to Python users without requiring the lower-level Dataset API.
Changes
python/pyarrow/orc.py:
Add filters parameter to read_table(). When specified, delegate to the Dataset API:
def read_table(source, columns=None, filesystem=None, filters=None):
if filters is not None:
import pyarrow.dataset as ds
filter_expr = filters
if not isinstance(filters, ds.Expression):
filter_expr = ds.filters_to_expression(filters)
dataset = ds.dataset(source, format='orc', filesystem=filesystem)
return dataset.to_table(columns=columns, filter=filter_expr)
# ... existing non-filter path unchangedSupported filter formats:
- Expression format:
ds.field('id') > 100 - DNF tuple format:
[('id', '>', 100)](Parquet-compatible) - Supported operators:
==,!=,<,>,<=,>=,in,not in
No Cython changes. This is pure Python, reusing existing Dataset API bindings and the filters_to_expression() utility already used by Parquet.
Examples
import pyarrow.orc as orc
import pyarrow.dataset as ds
# Expression format
table = orc.read_table('data.orc', filters=ds.field('id') > 1000)
# DNF tuple format
table = orc.read_table('data.orc', filters=[('id', '>', 1000)])
# Multiple conditions (AND)
table = orc.read_table('data.orc', filters=[('id', '>', 100), ('id', '<', 200)])
# With column projection
table = orc.read_table('data.orc', columns=['id', 'value'],
filters=[('id', '>', 1000)])Tests
Tests in python/pyarrow/tests/test_orc.py:
- Expression format smoke test
- DNF tuple format smoke test
- Integration with column projection
- Correctness validation: filtered result matches post-filter of full read
filters=Nonepreserves existing behavior
Component(s)
Python
Reactions are currently unavailable