Skip to content

[Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown #49363

@cbb330

Description

@cbb330

Summary

Part 4 of ORC predicate pushdown (#48986). Depends on #49361.

Add a filters parameter to pyarrow.orc.read_table() for API parity with Parquet's read_table(). This makes ORC predicate pushdown accessible to Python users without requiring the lower-level Dataset API.

Changes

python/pyarrow/orc.py:

Add filters parameter to read_table(). When specified, delegate to the Dataset API:

def read_table(source, columns=None, filesystem=None, filters=None):
    if filters is not None:
        import pyarrow.dataset as ds
        filter_expr = filters
        if not isinstance(filters, ds.Expression):
            filter_expr = ds.filters_to_expression(filters)
        dataset = ds.dataset(source, format='orc', filesystem=filesystem)
        return dataset.to_table(columns=columns, filter=filter_expr)
    # ... existing non-filter path unchanged

Supported filter formats:

  • Expression format: ds.field('id') > 100
  • DNF tuple format: [('id', '>', 100)] (Parquet-compatible)
  • Supported operators: ==, !=, <, >, <=, >=, in, not in

No Cython changes. This is pure Python, reusing existing Dataset API bindings and the filters_to_expression() utility already used by Parquet.

Examples

import pyarrow.orc as orc
import pyarrow.dataset as ds

# Expression format
table = orc.read_table('data.orc', filters=ds.field('id') > 1000)

# DNF tuple format
table = orc.read_table('data.orc', filters=[('id', '>', 1000)])

# Multiple conditions (AND)
table = orc.read_table('data.orc', filters=[('id', '>', 100), ('id', '<', 200)])

# With column projection
table = orc.read_table('data.orc', columns=['id', 'value'],
                       filters=[('id', '>', 1000)])

Tests

Tests in python/pyarrow/tests/test_orc.py:

  • Expression format smoke test
  • DNF tuple format smoke test
  • Integration with column projection
  • Correctness validation: filtered result matches post-filter of full read
  • filters=None preserves existing behavior

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions