Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes

## Context

`ExternalSorter` branches on `sort_in_place_threshold_bytes` (default 1MB) in [`in_mem_sort_stream()`](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L689-L721):

- **Below 1MB**: concatenate all buffered batches into one `RecordBatch`, sort in place
- **Above 1MB**: sort each batch individually, then streaming-merge them

This threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive in-memory sort") with the comment: *"This is a very rough heuristic and likely could be refined further."* It was later extracted to a config option by @alamb in #7130 with the same 1MB default. The default hasn't changed since, though the surrounding sort architecture has evolved significantly: multi-level merge (#15700), chunked sort output (#19494), `IncrementalSortIterator` (#20314), and `PartialSortExec` (#9125).

## Problem

The sort-each-batch-then-merge path dominates real workloads because typical in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192 rows) is sorted individually via `lexsort_to_indices` and then merged via `StreamingMergeBuilder`. This means:

1. **Per-batch sort kernels can't amortize overhead.** Row-format sorting (e.g., MSD radix sort on `RowConverter` output, apache/arrow-rs#9683) is 2–3x faster than `lexsort_to_indices` at 32K+ rows, but at 1K–8K rows the `RowConverter` encoding cost dominates. The sort-then-merge path never gives these kernels enough rows to benefit. #21525 attempted to integrate the radix sort kernel into `ExternalSorter` and saw no improvement for this reason.

2. **The concat path is gated on memory, not row count.** The 1MB threshold is a memory proxy, but the actual concern is the temporary 2x memory spike from `concat_batches` (plus `RowConverter` allocation on top). A row-count or batch-count heuristic might be a better fit.

3. **The `sort` benchmark doesn't exercise the merge path.** The benchmark produces 8 partitions of ~12 batches at 1024 rows each. In the `sort partitioned` variant, each partition's ~12K rows (~100KB for integers) falls well below the 1MB threshold, so it always takes the concat-and-sort-in-place path. This means benchmark results don't reflect the sort-then-merge path that dominates at larger data sizes.

## Prior art: DuckDB

For comparison, [DuckDB's sort redesign](https://duckdb.org/2025/09/24/sorting-again) encodes into its normalized key format as data arrives during the sink phase, accumulating into large thread-local sorted runs. The encoding cost is amortized across the entire input stream, and sorting happens once per run on already-encoded data. This avoids the small-batch problem entirely — by the time sorting begins, each thread has a single large run to sort.

## Possible directions

These aren't mutually exclusive:

- **Raise or rethink the threshold.** The 1MB limit was chosen conservatively. With `IncrementalSortIterator` (added in #20314) now yielding sorted output in chunks, the peak memory of the concat-and-sort path may be more manageable than it was in 2023. Could we raise it, or gate on row count instead?

- **Coalesce batches before sorting in the merge path.** When the merge path is taken, we could concatenate small batches into larger ones (e.g., 32K–64K rows) before sorting, giving row-format kernels enough rows to amortize encoding. This also reduces the merge fan-in, which is related to #7181 (cascaded merge for large fan-in). This doesn't require concatenating the entire buffer — just local coalescing.

- **Incremental `Rows` encoding.** `RowConverter::append` already supports incrementally extending a `Rows` buffer across batches. `ExternalSorter` could maintain a `Rows` alongside its `in_mem_batches`, calling `append` as each batch arrives (similar to DuckDB's approach). At sort time, the encoding is already done — you just sort the accumulated `Rows` and use the indices to reorder the original batches. The tradeoff is higher memory during accumulation (raw batches + encoded rows), but encoding cost is fully amortized and radix sort gets a large contiguous run to work with.

- **Fix the benchmark.** Increase `BATCH_SIZE` and/or `INPUT_SIZE` in `benches/sort.rs` so that `sort partitioned` exercises the sort-then-merge path.

## Related issues

- #7181 — Cascaded merge to reduce fan-in when merging many small sorted batches
- #19481 — Operators should respect `batch_size` instead of `Emit::All` (large batches flowing into `ExternalSorter` cause memory pressure)
- #19679 — Spill batches in chunks to avoid requiring full sort memory upfront

## References

- #6163 — Adaptive in-memory sort (introduced the 1MB threshold)
- #7130 — Extracted threshold to config
- #21525 — Radix sort integration attempt (benchmarks show no improvement due to small batch sizes)
- apache/arrow-rs#9683 — MSD radix sort kernel (2–3x faster at 32K+ rows)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes #21543

Context

Problem

Prior art: DuckDB

Possible directions

Related issues

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes #21543

Description

Context

Problem

Prior art: DuckDB

Possible directions

Related issues

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions