Skip to content

Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes #21543

@mbutrovich

Description

@mbutrovich

Context

ExternalSorter branches on sort_in_place_threshold_bytes (default 1MB) in in_mem_sort_stream():

  • Below 1MB: concatenate all buffered batches into one RecordBatch, sort in place
  • Above 1MB: sort each batch individually, then streaming-merge them

This threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive in-memory sort") with the comment: "This is a very rough heuristic and likely could be refined further." It was later extracted to a config option by @alamb in #7130 with the same 1MB default. The default hasn't changed since, though the surrounding sort architecture has evolved significantly: multi-level merge (#15700), chunked sort output (#19494), IncrementalSortIterator (#20314), and PartialSortExec (#9125).

Problem

The sort-each-batch-then-merge path dominates real workloads because typical in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192 rows) is sorted individually via lexsort_to_indices and then merged via StreamingMergeBuilder. This means:

  1. Per-batch sort kernels can't amortize overhead. Row-format sorting (e.g., MSD radix sort on RowConverter output, feat(arrow-row): add MSD radix sort kernel for row-encoded keys arrow-rs#9683) is 2–3x faster than lexsort_to_indices at 32K+ rows, but at 1K–8K rows the RowConverter encoding cost dominates. The sort-then-merge path never gives these kernels enough rows to benefit. perf: Bring over apache/arrow-rs/9683 radix sort, integrate into ExternalSorter #21525 attempted to integrate the radix sort kernel into ExternalSorter and saw no improvement for this reason.

  2. The concat path is gated on memory, not row count. The 1MB threshold is a memory proxy, but the actual concern is the temporary 2x memory spike from concat_batches (plus RowConverter allocation on top). A row-count or batch-count heuristic might be a better fit.

  3. The sort benchmark doesn't exercise the merge path. The benchmark produces 8 partitions of ~12 batches at 1024 rows each. In the sort partitioned variant, each partition's ~12K rows (~100KB for integers) falls well below the 1MB threshold, so it always takes the concat-and-sort-in-place path. This means benchmark results don't reflect the sort-then-merge path that dominates at larger data sizes.

Prior art: DuckDB

For comparison, DuckDB's sort redesign encodes into its normalized key format as data arrives during the sink phase, accumulating into large thread-local sorted runs. The encoding cost is amortized across the entire input stream, and sorting happens once per run on already-encoded data. This avoids the small-batch problem entirely — by the time sorting begins, each thread has a single large run to sort.

Possible directions

These aren't mutually exclusive:

  • Raise or rethink the threshold. The 1MB limit was chosen conservatively. With IncrementalSortIterator (added in fix: Unaccounted spill sort in row_hash #20314) now yielding sorted output in chunks, the peak memory of the concat-and-sort path may be more manageable than it was in 2023. Could we raise it, or gate on row count instead?

  • Coalesce batches before sorting in the merge path. When the merge path is taken, we could concatenate small batches into larger ones (e.g., 32K–64K rows) before sorting, giving row-format kernels enough rows to amortize encoding. This also reduces the merge fan-in, which is related to Improve performance of large sorts with Cascaded merge / tree #7181 (cascaded merge for large fan-in). This doesn't require concatenating the entire buffer — just local coalescing.

  • Incremental Rows encoding. RowConverter::append already supports incrementally extending a Rows buffer across batches. ExternalSorter could maintain a Rows alongside its in_mem_batches, calling append as each batch arrives (similar to DuckDB's approach). At sort time, the encoding is already done — you just sort the accumulated Rows and use the indices to reorder the original batches. The tradeoff is higher memory during accumulation (raw batches + encoded rows), but encoding cost is fully amortized and radix sort gets a large contiguous run to work with.

  • Fix the benchmark. Increase BATCH_SIZE and/or INPUT_SIZE in benches/sort.rs so that sort partitioned exercises the sort-then-merge path.

Related issues

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions