Skip to content

feat: add gather_time and coalesce_time metrics to shuffle write#3734

Draft
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-write-profiling
Draft

feat: add gather_time and coalesce_time metrics to shuffle write#3734
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-write-profiling

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

No issue. This adds observability to the shuffle write path.

Rationale for this change

The shuffle write path has existing metrics for repart_time (partition ID computation), encode_time (IPC + compression), and write_time (disk I/O), but a large portion of shuffle write time was unaccounted for. Profiling TPC-H 100GB revealed that interleave_record_batch (the gather step that pulls rows from buffered batches into per-partition output) accounts for 55% of total shuffle write time, but this was invisible in the existing metrics.

What changes are included in this PR?

Adds two new timing metrics to the shuffle write path:

  • gather_time: Time spent in interleave_record_batch, which gathers rows from buffered batches into per-partition batches during the write phase. This is the most expensive step in the shuffle write path (55.4% of total time in TPC-H 100GB benchmarks).
  • coalesce_time: Time spent in BatchCoalescer, which merges small batches before IPC serialization (1.4% of total time).

These metrics are wired through from the Rust native code to Spark SQL metrics, appearing alongside existing shuffle metrics in the Spark UI.

TPC-H 100GB Shuffle Write Breakdown (with new metrics)

Component Time % of shuffle write
gather/interleave 702.9s 55.4%
encoding + compression 302.9s 23.9%
write to disk 75.8s 6.0%
repartition (hash) 74.1s 5.8%
batch coalescing 17.4s 1.4%
unaccounted 95.2s 7.5%

The timing overhead is negligible (~0.3s out of 1268s total, using Instant::now() which is ~50ns per call).

How are these changes tested?

Existing shuffle tests (CometNativeShuffleSuite, CometExecSuite) continue to pass. The metrics are additive — they don't change any behavior, only add timing instrumentation to existing code paths. Verified with TPC-H 100GB benchmark that metrics are correctly reported and sum to expected totals.

Pass coalesce_time to BufBatchWriter::write and flush calls across all
callers. Add gather_time instrumentation around PartitionedBatchIterator
iteration in both multi_partition shuffle_write_partition and
partition_writer spill to measure interleave_record_batch time.
@andygrove andygrove marked this pull request as draft March 19, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant