Skip to content

[core] Fix OOM when writing/compacting table with large records#7621

Open
yugan95 wants to merge 1 commit intoapache:masterfrom
yugan95:record-0410
Open

[core] Fix OOM when writing/compacting table with large records#7621
yugan95 wants to merge 1 commit intoapache:masterfrom
yugan95:record-0410

Conversation

@yugan95
Copy link
Copy Markdown
Contributor

@yugan95 yugan95 commented Apr 10, 2026

Purpose

Linked issue: close #7620
Fix OOM when writing table with large records (100MB+) and many buckets (e.g. 256) due to unbounded buffer growth in sort, merge and compaction paths. Each bucket's writer independently holds its own sort buffer, merge channels, and compaction readers. When a large record inflates an internal reuse buffer, that bloated buffer is retained per-bucket, causing memory usage to quickly exceed available heap.

Heap dump analysis identified four independent root causes:

1. Sort path — RowHelper internal buffer never shrinks

RowHelper.reuseWriter grows its internal MemorySegment list for large records, but BinaryRowWriter.reset() only resets the cursor without releasing oversized segments. Additionally, InternalRowSerializer.serialize() can exit via EOFException (a normal signal when the sort buffer is full), skipping any cleanup of the bloated buffer.

2. Merge path — BinaryRowSerializer.deserialize(reuse) only grows, never shrinks

Each merge channel holds a BinaryRow reuse instance. When a large record is deserialized, the backing MemorySegment grows to fit it but is never shrunk for subsequent small records. With max-num-file-handles (default 128) channels each retaining a 100MB+ buffer, memory usage explodes.

3. Compaction read path — HeapBytesVector.reserveBytes() integer overflow

reserveBytes() computes newCapacity * 2 using plain multiplication. When newCapacity exceeds ~1.07 billion bytes, this overflows Integer.MAX_VALUE, causing NegativeArraySizeException or silent data corruption.

4. Parquet write — statistics and page-size-check config not passed through

RowDataParquetBuilder does not pass through parquet.statistics.truncate.length, parquet.columnindex.truncate.length, parquet.page.size.row.check.min, and parquet.page.size.row.check.max. Without these, users cannot tune Parquet behavior for large-record scenarios, leading to multi-GB pages and bloated footers.

Changes

  1. RowHelper: add resetIfTooLarge() — release internal buffer when segments exceed 4MB
  2. InternalRowSerializer: call resetIfTooLarge() in finally block of serialize() and serializeToPages() to handle EOFException exit path
  3. BinaryRowSerializer: add shrink logic in deserialize(reuse) — reallocate when existing buffer > 4MB threshold
  4. HeapBytesVector: use bit-shift (<< 1) instead of * 2, cap at MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8, throw clear error on overflow
  5. RowDataParquetBuilder: pass through statistics.truncate.length, columnindex.truncate.length, min-row-count-for-page-size-check, max-row-count-for-page-size-check from config

Tests

  • RowHelperTest — validates resetIfTooLarge() releases oversized buffers (> 4MB) and preserves small ones
  • BinaryRowSerializerShrinkTest — validates deserialize(reuse) shrinks oversized buffers and preserves small ones
  • HeapBytesVectorReserveBytesTest — validates overflow-safe reserveBytes() growth and data correctness

API and Format

N/A — no public API or format changes.

Documentation

N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths

1 participant