-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
We originally saw this issue with reading parquet files into Pandas DataFrames via PyArrow, where a column containing arrays of floats has corrupted values when applying a filter at read time. We've narrowed it down to the below example (without any I/O or Pandas involved). Note that the error shown in the script below only happens with large values of N. Here are the N values that I tested:
- 1,000 - ok
- 10,000 - ok
- 100,000 - ok
- 250,000 - ok
- 500,000 - fail
- 1,000,000 - fail
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
N = 500_000
ARRAY_LEN = 2000
ids = np.arange(N)
texts = [f"Row {i} with data" for i in range(N)]
rng = np.random.default_rng(42)
matrix = rng.random((N, ARRAY_LEN))
matrix[:, 0] = ids
numbers = [matrix[i] for i in range(N)]
tbl = pa.table({"id": ids, "text": texts, "numbers": numbers})
print("PYARROW VERSION:", pa.__version__)
print()
print("ORIGINAL DATA")
print(ids[N - 1])
print(numbers[N - 1].tolist()[:5])
print()
print("SLICED DATA")
print(tbl.slice(N - 1, 1))
print()
print("FILTERED DATA")
print(tbl.filter(pc.field("id") == N - 1))Output (generated on Ubuntu 22.04 x86_64 with pyarrow==23.0.1):
PYARROW VERSION: 23.0.1
ORIGINAL DATA
499999
[499999.0, 0.2806802660498191, 0.18948458094650322, 0.6611584406407851, 0.340530752637791]
SLICED DATA
pyarrow.Table
id: int64
text: string
numbers: list<item: double>
child 0, item: double
----
id: [[499999]]
text: [["Row 499999 with data"]]
numbers: [[[499999,0.2806802660498191,0.18948458094650322,0.6611584406407851,0.340530752637791,...,0.19918275933231844,0.42906946186903017,0.49644347191463034,0.3171420306034032,0.13584405454197468]]]
FILTERED DATA
pyarrow.Table
id: int64
text: string
numbers: list<item: double>
child 0, item: double
----
id: [[],[],...,[],[499999]]
text: [[],[],...,[],["Row 499999 with data"]]
numbers: [[],[],...,[],[[0.31442923271553835,0.6938060356899268,0.6428265846122176,0.45896565050138827,0.5739393526702229,...,0.13894123671983727,0.47783950795209007,0.7710005399634996,0.6678959811701984,0.7366509797101941]]]
Component(s)
Python
Reactions are currently unavailable