Skip to content

Filtering corrupts data in column containing an array #49392

@brett-patterson-ent

Description

@brett-patterson-ent

Describe the bug, including details regarding any error messages, version, and platform.

We originally saw this issue with reading parquet files into Pandas DataFrames via PyArrow, where a column containing arrays of floats has corrupted values when applying a filter at read time. We've narrowed it down to the below example (without any I/O or Pandas involved). Note that the error shown in the script below only happens with large values of N. Here are the N values that I tested:

  • 1,000 - ok
  • 10,000 - ok
  • 100,000 - ok
  • 250,000 - ok
  • 500,000 - fail
  • 1,000,000 - fail
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc

N = 500_000
ARRAY_LEN = 2000

ids = np.arange(N)
texts = [f"Row {i} with data" for i in range(N)]

rng = np.random.default_rng(42)
matrix = rng.random((N, ARRAY_LEN))
matrix[:, 0] = ids
numbers = [matrix[i] for i in range(N)]

tbl = pa.table({"id": ids, "text": texts, "numbers": numbers})

print("PYARROW VERSION:", pa.__version__)
print()

print("ORIGINAL DATA")
print(ids[N - 1])
print(numbers[N - 1].tolist()[:5])
print()

print("SLICED DATA")
print(tbl.slice(N - 1, 1))
print()

print("FILTERED DATA")
print(tbl.filter(pc.field("id") == N - 1))

Output (generated on Ubuntu 22.04 x86_64 with pyarrow==23.0.1):

PYARROW VERSION: 23.0.1

ORIGINAL DATA
499999
[499999.0, 0.2806802660498191, 0.18948458094650322, 0.6611584406407851, 0.340530752637791]

SLICED DATA
pyarrow.Table
id: int64
text: string
numbers: list<item: double>
  child 0, item: double
----
id: [[499999]]
text: [["Row 499999 with data"]]
numbers: [[[499999,0.2806802660498191,0.18948458094650322,0.6611584406407851,0.340530752637791,...,0.19918275933231844,0.42906946186903017,0.49644347191463034,0.3171420306034032,0.13584405454197468]]]

FILTERED DATA
pyarrow.Table
id: int64
text: string
numbers: list<item: double>
  child 0, item: double
----
id: [[],[],...,[],[499999]]
text: [[],[],...,[],["Row 499999 with data"]]
numbers: [[],[],...,[],[[0.31442923271553835,0.6938060356899268,0.6428265846122176,0.45896565050138827,0.5739393526702229,...,0.13894123671983727,0.47783950795209007,0.7710005399634996,0.6678959811701984,0.7366509797101941]]]

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions