You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ExternalSorter branches on sort_in_place_threshold_bytes (default 1MB) in in_mem_sort_stream():
Below 1MB: concatenate all buffered batches into one RecordBatch, sort in place
Above 1MB: sort each batch individually, then streaming-merge them
This threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive in-memory sort") with the comment: "This is a very rough heuristic and likely could be refined further." It was later extracted to a config option by @alamb in #7130 with the same 1MB default. The default hasn't changed since, though the surrounding sort architecture has evolved significantly: multi-level merge (#15700), chunked sort output (#19494), IncrementalSortIterator (#20314), and PartialSortExec (#9125).
Problem
The sort-each-batch-then-merge path dominates real workloads because typical in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192 rows) is sorted individually via lexsort_to_indices and then merged via StreamingMergeBuilder. This means:
The concat path is gated on memory, not row count. The 1MB threshold is a memory proxy, but the actual concern is the temporary 2x memory spike from concat_batches (plus RowConverter allocation on top). A row-count or batch-count heuristic might be a better fit.
The sort benchmark doesn't exercise the merge path. The benchmark produces 8 partitions of ~12 batches at 1024 rows each. In the sort partitioned variant, each partition's ~12K rows (~100KB for integers) falls well below the 1MB threshold, so it always takes the concat-and-sort-in-place path. This means benchmark results don't reflect the sort-then-merge path that dominates at larger data sizes.
Prior art: DuckDB
For comparison, DuckDB's sort redesign encodes into its normalized key format as data arrives during the sink phase, accumulating into large thread-local sorted runs. The encoding cost is amortized across the entire input stream, and sorting happens once per run on already-encoded data. This avoids the small-batch problem entirely — by the time sorting begins, each thread has a single large run to sort.
Possible directions
These aren't mutually exclusive:
Raise or rethink the threshold. The 1MB limit was chosen conservatively. With IncrementalSortIterator (added in fix: Unaccounted spill sort in row_hash #20314) now yielding sorted output in chunks, the peak memory of the concat-and-sort path may be more manageable than it was in 2023. Could we raise it, or gate on row count instead?
Coalesce batches before sorting in the merge path. When the merge path is taken, we could concatenate small batches into larger ones (e.g., 32K–64K rows) before sorting, giving row-format kernels enough rows to amortize encoding. This also reduces the merge fan-in, which is related to Improve performance of large sorts with Cascaded merge / tree #7181 (cascaded merge for large fan-in). This doesn't require concatenating the entire buffer — just local coalescing.
Incremental Rows encoding.RowConverter::append already supports incrementally extending a Rows buffer across batches. ExternalSorter could maintain a Rows alongside its in_mem_batches, calling append as each batch arrives (similar to DuckDB's approach). At sort time, the encoding is already done — you just sort the accumulated Rows and use the indices to reorder the original batches. The tradeoff is higher memory during accumulation (raw batches + encoded rows), but encoding cost is fully amortized and radix sort gets a large contiguous run to work with.
Fix the benchmark. Increase BATCH_SIZE and/or INPUT_SIZE in benches/sort.rs so that sort partitioned exercises the sort-then-merge path.
Context
ExternalSorterbranches onsort_in_place_threshold_bytes(default 1MB) inin_mem_sort_stream():RecordBatch, sort in placeThis threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive in-memory sort") with the comment: "This is a very rough heuristic and likely could be refined further." It was later extracted to a config option by @alamb in #7130 with the same 1MB default. The default hasn't changed since, though the surrounding sort architecture has evolved significantly: multi-level merge (#15700), chunked sort output (#19494),
IncrementalSortIterator(#20314), andPartialSortExec(#9125).Problem
The sort-each-batch-then-merge path dominates real workloads because typical in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192 rows) is sorted individually via
lexsort_to_indicesand then merged viaStreamingMergeBuilder. This means:Per-batch sort kernels can't amortize overhead. Row-format sorting (e.g., MSD radix sort on
RowConverteroutput, feat(arrow-row): add MSD radix sort kernel for row-encoded keys arrow-rs#9683) is 2–3x faster thanlexsort_to_indicesat 32K+ rows, but at 1K–8K rows theRowConverterencoding cost dominates. The sort-then-merge path never gives these kernels enough rows to benefit. perf: Bring over apache/arrow-rs/9683 radix sort, integrate into ExternalSorter #21525 attempted to integrate the radix sort kernel intoExternalSorterand saw no improvement for this reason.The concat path is gated on memory, not row count. The 1MB threshold is a memory proxy, but the actual concern is the temporary 2x memory spike from
concat_batches(plusRowConverterallocation on top). A row-count or batch-count heuristic might be a better fit.The
sortbenchmark doesn't exercise the merge path. The benchmark produces 8 partitions of ~12 batches at 1024 rows each. In thesort partitionedvariant, each partition's ~12K rows (~100KB for integers) falls well below the 1MB threshold, so it always takes the concat-and-sort-in-place path. This means benchmark results don't reflect the sort-then-merge path that dominates at larger data sizes.Prior art: DuckDB
For comparison, DuckDB's sort redesign encodes into its normalized key format as data arrives during the sink phase, accumulating into large thread-local sorted runs. The encoding cost is amortized across the entire input stream, and sorting happens once per run on already-encoded data. This avoids the small-batch problem entirely — by the time sorting begins, each thread has a single large run to sort.
Possible directions
These aren't mutually exclusive:
Raise or rethink the threshold. The 1MB limit was chosen conservatively. With
IncrementalSortIterator(added in fix: Unaccounted spill sort in row_hash #20314) now yielding sorted output in chunks, the peak memory of the concat-and-sort path may be more manageable than it was in 2023. Could we raise it, or gate on row count instead?Coalesce batches before sorting in the merge path. When the merge path is taken, we could concatenate small batches into larger ones (e.g., 32K–64K rows) before sorting, giving row-format kernels enough rows to amortize encoding. This also reduces the merge fan-in, which is related to Improve performance of large sorts with Cascaded merge / tree #7181 (cascaded merge for large fan-in). This doesn't require concatenating the entire buffer — just local coalescing.
Incremental
Rowsencoding.RowConverter::appendalready supports incrementally extending aRowsbuffer across batches.ExternalSortercould maintain aRowsalongside itsin_mem_batches, callingappendas each batch arrives (similar to DuckDB's approach). At sort time, the encoding is already done — you just sort the accumulatedRowsand use the indices to reorder the original batches. The tradeoff is higher memory during accumulation (raw batches + encoded rows), but encoding cost is fully amortized and radix sort gets a large contiguous run to work with.Fix the benchmark. Increase
BATCH_SIZEand/orINPUT_SIZEinbenches/sort.rsso thatsort partitionedexercises the sort-then-merge path.Related issues
batch_sizeinstead ofEmit::All(large batches flowing intoExternalSortercause memory pressure)References