Skip to content

Compact more aggressively in TopK based upon memory usage#20381

Open
cetra3 wants to merge 1 commit intoapache:mainfrom
pydantic:topk_memory_batch
Open

Compact more aggressively in TopK based upon memory usage#20381
cetra3 wants to merge 1 commit intoapache:mainfrom
pydantic:topk_memory_batch

Conversation

@cetra3
Copy link
Contributor

@cetra3 cetra3 commented Feb 16, 2026

Which issue does this PR close?

Sort of addresses #19386

Rationale for this change

Compaction in TopK values currently uses some hard set heuristics to decide when to compact. Instead we can use the memory size of the batches as a bound.

What changes are included in this PR?

Adjusts TopK compaction to compact more aggressively, based upon memory size.

Are these changes tested?

Yes and a test has been added.

Are there any user-facing changes?

No

Benchmarks

I'm struggling (in this PR and other PRs...) to get some good reliable benchmarks through.

However with this one it seems like it's mostly the same speed as main or a little faster in some cases:

+ critcmp main topk_memory_batch
group                                                             main                                   topk_memory_batch
-----                                                             ----                                   -----------------
aggregate 10000000 time-series rows                               1.05     45.5±2.13ms        ? ?/sec    1.00     43.5±2.09ms        ? ?/sec
aggregate 10000000 worst-case rows                                1.06     43.9±1.12ms        ? ?/sec    1.00     41.5±1.03ms        ? ?/sec
distinct 10000000 rows asc [TopK]                                 1.00      4.3±0.25ms        ? ?/sec    1.00      4.3±0.07ms        ? ?/sec
distinct 10000000 rows asc [no TopK]                              1.00     42.9±1.83ms        ? ?/sec    1.06     45.5±1.65ms        ? ?/sec
distinct 10000000 rows desc [TopK]                                1.00      4.3±0.07ms        ? ?/sec    1.01      4.3±0.06ms        ? ?/sec
distinct 10000000 rows desc [no TopK]                             1.00     42.4±1.48ms        ? ?/sec    1.05     44.7±1.88ms        ? ?/sec
top k=10 aggregate 10000000 time-series rows                      1.13     10.6±0.54ms        ? ?/sec    1.00      9.4±0.64ms        ? ?/sec
top k=10 aggregate 10000000 time-series rows [Utf8View]           1.08     11.0±0.65ms        ? ?/sec    1.00     10.2±0.48ms        ? ?/sec
top k=10 aggregate 10000000 worst-case rows                       1.04     16.8±1.47ms        ? ?/sec    1.00     16.1±1.44ms        ? ?/sec
top k=10 aggregate 10000000 worst-case rows [Utf8View]            1.10     17.9±1.51ms        ? ?/sec    1.00     16.2±1.16ms        ? ?/sec
top k=10 string aggregate 10000000 time-series rows [Utf8View]    1.11      9.1±0.36ms        ? ?/sec    1.00      8.2±0.35ms        ? ?/sec
top k=10 string aggregate 10000000 time-series rows [Utf8]        1.12      7.8±0.48ms        ? ?/sec    1.00      6.9±0.26ms        ? ?/sec
top k=10 string aggregate 10000000 worst-case rows [Utf8View]     1.06      8.6±0.21ms        ? ?/sec    1.00      8.1±0.20ms        ? ?/sec
top k=10 string aggregate 10000000 worst-case rows [Utf8]         1.07      7.4±0.17ms        ? ?/sec    1.00      6.9±0.13ms        ? ?/sec

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 16, 2026
@Dandandan
Copy link
Contributor

run benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments