Skip to content

GH-3449: Fix BinaryStats int overflow#3448

Open
Jiayi-Wang-db wants to merge 2 commits intoapache:masterfrom
Jiayi-Wang-db:fix-binary-stats-overflow
Open

GH-3449: Fix BinaryStats int overflow#3448
Jiayi-Wang-db wants to merge 2 commits intoapache:masterfrom
Jiayi-Wang-db:fix-binary-stats-overflow

Conversation

@Jiayi-Wang-db
Copy link
Contributor

@Jiayi-Wang-db Jiayi-Wang-db commented Mar 16, 2026

Rationale for this change

BinaryStatistics.isSmallerThan() computes min.length() + max.length() as int + int. When a binary column contains values >= 2^30 bytes (1 GB), this sum overflows to a negative number, which is then incorrectly evaluated as smaller than MAX_STATS_SIZE (4096). This causes the full multi-GB min/max statistics to be serialized into the Thrift footer instead of being dropped, resulting in a corrupted Parquet file whose footer exceeds the 4-byte length field and cannot be read back.

What changes are included in this PR?

Promote int arithmetic to long before addition in BinaryStatistics.isSmallerThan() and isSmallerThanWithTruncation() to prevent integer overflow when binary values are very large.

Are these changes tested?

Yes. Added a unit test in TestStatistics that creates a 1 GB binary value and asserts isSmallerThan(4096) correctly returns false.

Are there any user-facing changes?

Parquet files with very large binary values (>= 1 GB) will no longer have corrupted footers. The oversized statistics are now correctly dropped, producing a valid and readable file.

Closes #3449

@Jiayi-Wang-db Jiayi-Wang-db changed the title Fix BinaryStats int overflow GH-3449: Fix BinaryStats int overflow Mar 16, 2026
PrimitiveType type = Types.required(BINARY).named("test_binary");
Statistics<?> stats = Statistics.getBuilderForReading(type).build();

byte[] largeValue = new byte[1_073_741_824]; // 2^30 = 1 GB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to allocate such large memory in the test?

Use Binary.fromConstantByteArray with a fake large length instead of
allocating a real 1 GB byte array. The test only exercises length()
arithmetic in isSmallerThan, so the backing bytes are never accessed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix BinaryStats int overflow

2 participants