Skip to content

Optimize non fixed size segmented reduce for small segments using max_segment_size#7718

Merged
srinivasyadav18 merged 18 commits intoNVIDIA:mainfrom
srinivasyadav18:opt_non_fixed_seg_reduce
Mar 10, 2026
Merged

Optimize non fixed size segmented reduce for small segments using max_segment_size#7718
srinivasyadav18 merged 18 commits intoNVIDIA:mainfrom
srinivasyadav18:opt_non_fixed_seg_reduce

Conversation

@srinivasyadav18
Copy link
Copy Markdown
Contributor

Description

closes #6898

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 requested review from a team as code owners February 19, 2026 02:20
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Feb 19, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Feb 19, 2026
@srinivasyadav18 srinivasyadav18 changed the title Opt non fixed size segmented reduce for small segments using max_segment_size Optimize non fixed size segmented reduce for small segments using max_segment_size Feb 19, 2026
@github-actions

This comment has been minimized.

Comment thread cub/cub/device/dispatch/tuning/tuning_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh
Comment thread cub/cub/device/dispatch/dispatch_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_segmented_reduce.cuh Outdated
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu Outdated
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is massively complicated by the fact that the segmented reduction dispatch was already refactored to the new tuning API, and the fixed size segmented dispatch was not. I strongly suggest to refactor the fixed size dispatch first (#7641) and then rebase this PR.

Comment thread cub/benchmarks/bench/segmented_reduce/variable_argmax.cu Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_argmax.cu Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
Comment thread c/parallel/src/segmented_reduce.cu
Comment thread cub/benchmarks/bench/segmented_reduce/base.cuh
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
@github-actions

This comment has been minimized.

Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I believe int(...) is unnecessary here, since block_threads member of segmented_reduce_policy struct already is of type int.

uses cuda iterators in benchmark
uses proper init for thrust vectors
clean up docs in kernel
disables sass check for c parallel segmented reduce test
@github-actions

This comment has been minimized.

Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu Outdated
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu Outdated
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu Outdated
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu
Comment thread cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh Outdated
Comment thread cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu Outdated
@github-actions

This comment has been minimized.

@srinivasyadav18
Copy link
Copy Markdown
Contributor Author

srinivasyadav18 commented Mar 4, 2026

Performance Report:

small, medium, large segments reduction code path using default max_segment_size 0

NVIDIA RTX A6000 (SM 86)

argmax T{ct}=F64 OffsetT=I32 - no regressions - mostly noise
|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  MaxSegmentSize  |  GuaranteedMaxSegSize  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   F64   |      I32      |      2^16      |       2^1        |           0            | 432.990 us |       3.31% | 433.910 us |       2.67% |      0.920 us |   0.21% |   SAME   |
|   F64   |      I32      |      2^20      |       2^1        |           0            |   6.683 ms |       0.14% |   6.693 ms |       0.43% |     10.359 us |   0.16% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^1        |           0            | 107.251 ms |       0.00% | 106.703 ms |       0.16% |   -548.173 us |  -0.51% |   FAST   |
|   F64   |      I32      |      2^28      |       2^1        |           0            |    1.728 s |       0.00% |    1.713 s |       0.17% | -14510.783 us |  -0.84% |   FAST   |
|   F64   |      I32      |      2^16      |       2^2        |           0            | 257.541 us |       0.20% | 256.362 us |       0.19% |     -1.179 us |  -0.46% |   FAST   |
|   F64   |      I32      |      2^20      |       2^2        |           0            |   4.057 ms |       0.01% |   4.039 ms |       0.01% |    -17.453 us |  -0.43% |   FAST   |
|   F64   |      I32      |      2^24      |       2^2        |           0            |  64.998 ms |       0.15% |  64.530 ms |       0.00% |   -467.709 us |  -0.72% |   FAST   |
|   F64   |      I32      |      2^28      |       2^2        |           0            |    1.040 s |       0.00% |    1.039 s |       0.33% |  -1922.648 us |  -0.18% |   FAST   |
|   F64   |      I32      |      2^16      |       2^3        |           0            | 144.858 us |       0.34% | 144.804 us |       0.33% |     -0.053 us |  -0.04% |   SAME   |
|   F64   |      I32      |      2^20      |       2^3        |           0            |   2.263 ms |       0.02% |   2.263 ms |       0.01% |      0.064 us |   0.00% |   SAME   |
|   F64   |      I32      |      2^24      |       2^3        |           0            |  36.267 ms |       0.23% |  36.133 ms |       0.00% |   -134.580 us |  -0.37% |   FAST   |
|   F64   |      I32      |      2^28      |       2^3        |           0            | 580.976 ms |       0.00% | 578.035 ms |       0.00% |  -2940.444 us |  -0.51% |   FAST   |
|   F64   |      I32      |      2^16      |       2^4        |           0            |  79.513 us |       0.62% |  79.280 us |       0.62% |     -0.232 us |  -0.29% |   SAME   |
|   F64   |      I32      |      2^20      |       2^4        |           0            |   1.208 ms |       0.02% |   1.202 ms |       0.02% |     -6.125 us |  -0.51% |   FAST   |
|   F64   |      I32      |      2^24      |       2^4        |           0            |  19.221 ms |       0.00% |  19.124 ms |       0.00% |    -97.276 us |  -0.51% |   FAST   |
|   F64   |      I32      |      2^28      |       2^4        |           0            | 307.590 ms |       0.00% | 307.901 ms |       0.26% |    310.581 us |   0.10% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^5        |           0            |  43.993 us |       0.66% |  44.116 us |       0.65% |      0.123 us |   0.28% |   SAME   |
|   F64   |      I32      |      2^20      |       2^5        |           0            | 623.977 us |       0.08% | 625.161 us |       0.08% |      1.184 us |   0.19% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^5        |           0            |   9.911 ms |       0.01% |   9.932 ms |       0.00% |     21.381 us |   0.22% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^5        |           0            | 158.466 ms |       0.00% | 158.805 ms |       0.00% |    338.517 us |   0.21% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^6        |           0            |  25.094 us |       2.03% |  25.390 us |       1.67% |      0.296 us |   1.18% |   SAME   |
|   F64   |      I32      |      2^20      |       2^6        |           0            | 323.644 us |       0.11% | 324.395 us |       0.13% |      0.751 us |   0.23% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^6        |           0            |   5.096 ms |       0.01% |   5.108 ms |       0.01% |     11.315 us |   0.22% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^6        |           0            |  81.542 ms |       0.32% |  81.584 ms |       0.00% |     41.983 us |   0.05% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^7        |           0            |  15.581 us |       3.23% |  15.384 us |       3.48% |     -0.197 us |  -1.27% |   SAME   |
|   F64   |      I32      |      2^20      |       2^7        |           0            | 170.501 us |       0.30% | 169.341 us |       0.29% |     -1.160 us |  -0.68% |   FAST   |
|   F64   |      I32      |      2^24      |       2^7        |           0            |   2.654 ms |       0.02% |   2.638 ms |       0.02% |    -15.899 us |  -0.60% |   FAST   |
|   F64   |      I32      |      2^28      |       2^7        |           0            |  42.375 ms |       0.00% |  42.126 ms |       0.00% |   -249.206 us |  -0.59% |   FAST   |
|   F64   |      I32      |      2^16      |       2^8        |           0            |  10.917 us |       4.31% |  10.607 us |       4.61% |     -0.310 us |  -2.84% |   SAME   |
|   F64   |      I32      |      2^20      |       2^8        |           0            |  93.845 us |       0.56% |  92.140 us |       0.61% |     -1.705 us |  -1.82% |   FAST   |
|   F64   |      I32      |      2^24      |       2^8        |           0            |   1.413 ms |       0.04% |   1.393 ms |       0.04% |    -19.705 us |  -1.39% |   FAST   |
|   F64   |      I32      |      2^28      |       2^8        |           0            |  22.466 ms |       0.01% |  22.206 ms |       0.16% |   -259.863 us |  -1.16% |   FAST   |
|   F64   |      I32      |      2^16      |       2^9        |           0            |   8.512 us |       5.44% |   8.495 us |       5.28% |     -0.017 us |  -0.20% |   SAME   |
|   F64   |      I32      |      2^20      |       2^9        |           0            |  56.145 us |       0.99% |  54.719 us |       1.08% |     -1.426 us |  -2.54% |   FAST   |
|   F64   |      I32      |      2^24      |       2^9        |           0            | 785.992 us |       0.08% | 767.307 us |       0.08% |    -18.685 us |  -2.38% |   FAST   |
|   F64   |      I32      |      2^28      |       2^9        |           0            |  12.470 ms |       0.02% |  12.174 ms |       0.01% |   -296.167 us |  -2.37% |   FAST   |
|   F64   |      I32      |      2^16      |       2^10       |           0            |   7.738 us |       6.61% |   7.728 us |       6.48% |     -0.009 us |  -0.12% |   SAME   |
|   F64   |      I32      |      2^20      |       2^10       |           0            |  37.404 us |       1.71% |  36.292 us |       1.76% |     -1.112 us |  -2.97% |   FAST   |
|   F64   |      I32      |      2^24      |       2^10       |           0            | 470.617 us |       0.15% | 454.965 us |       0.33% |    -15.651 us |  -3.33% |   FAST   |
|   F64   |      I32      |      2^28      |       2^10       |           0            |   7.428 ms |       0.44% |   7.226 ms |       0.39% |   -201.595 us |  -2.71% |   FAST   |
|   F64   |      I32      |      2^16      |       2^11       |           0            |   9.314 us |       3.55% |   9.440 us |       4.46% |      0.126 us |   1.35% |   SAME   |
|   F64   |      I32      |      2^20      |       2^11       |           0            |  30.691 us |       2.83% |  29.973 us |       2.60% |     -0.717 us |  -2.34% |   SAME   |
|   F64   |      I32      |      2^24      |       2^11       |           0            | 311.504 us |       0.44% | 305.707 us |       0.93% |     -5.798 us |  -1.86% |   FAST   |
|   F64   |      I32      |      2^28      |       2^11       |           0            |   4.914 ms |       0.57% |   4.819 ms |       0.81% |    -94.584 us |  -1.92% |   FAST   |
|   F64   |      I32      |      2^16      |       2^12       |           0            |  11.573 us |       4.46% |  11.908 us |       4.24% |      0.335 us |   2.89% |   SAME   |
|   F64   |      I32      |      2^20      |       2^12       |           0            |  27.884 us |       3.66% |  29.398 us |       2.02% |      1.513 us |   5.43% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^12       |           0            | 233.317 us |       1.17% | 231.084 us |       1.40% |     -2.233 us |  -0.96% |   SAME   |
|   F64   |      I32      |      2^28      |       2^12       |           0            |   3.624 ms |       1.00% |   3.591 ms |       1.08% |    -33.234 us |  -0.92% |   SAME   |
|   F64   |      I32      |      2^16      |       2^13       |           0            |  14.936 us |       3.55% |  15.065 us |       3.87% |      0.129 us |   0.86% |   SAME   |
|   F64   |      I32      |      2^20      |       2^13       |           0            |  26.840 us |       1.99% |  26.814 us |       1.98% |     -0.027 us |  -0.10% |   SAME   |
|   F64   |      I32      |      2^24      |       2^13       |           0            | 210.930 us |       0.82% | 212.404 us |       0.94% |      1.473 us |   0.70% |   SAME   |
|   F64   |      I32      |      2^28      |       2^13       |           0            |   3.073 ms |       1.46% |   3.072 ms |       1.56% |     -1.205 us |  -0.04% |   SAME   |
|   F64   |      I32      |      2^16      |       2^14       |           0            |  20.891 us |       3.66% |  22.084 us |       2.73% |      1.193 us |   5.71% |   SLOW   |
|   F64   |      I32      |      2^20      |       2^14       |           0            |  28.639 us |       2.22% |  29.059 us |       2.20% |      0.419 us |   1.46% |   SAME   |
|   F64   |      I32      |      2^24      |       2^14       |           0            | 217.009 us |       1.44% | 218.065 us |       1.48% |      1.056 us |   0.49% |   SAME   |
|   F64   |      I32      |      2^28      |       2^14       |           0            |   3.071 ms |       1.71% |   3.082 ms |       1.97% |     10.940 us |   0.36% |   SAME   |
|   F64   |      I32      |      2^16      |       2^15       |           0            |  25.440 us |       3.03% |  26.451 us |       2.85% |      1.011 us |   3.97% |   SLOW   |
|   F64   |      I32      |      2^20      |       2^15       |           0            |  39.473 us |       2.15% |  40.026 us |       1.86% |      0.553 us |   1.40% |   SAME   |
|   F64   |      I32      |      2^24      |       2^15       |           0            | 226.313 us |       2.38% | 229.214 us |       2.55% |      2.900 us |   1.28% |   SAME   |
|   F64   |      I32      |      2^28      |       2^15       |           0            |   3.092 ms |       1.85% |   3.102 ms |       2.01% |     10.282 us |   0.33% |   SAME   |
|   F64   |      I32      |      2^16      |       2^16       |           0            |  40.179 us |       1.71% |  41.387 us |       2.48% |      1.208 us |   3.01% |   SLOW   |
|   F64   |      I32      |      2^20      |       2^16       |           0            |  68.584 us |       1.34% |  69.706 us |       1.37% |      1.122 us |   1.64% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^16       |           0            | 238.269 us |       3.20% | 239.995 us |       0.56% |      1.726 us |   0.72% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^16       |           0            |   3.123 ms |       1.58% |   3.135 ms |       1.82% |     12.358 us |   0.40% |   SAME   |
sum T{ct}=F64 OffsetT=I32 - no regressions - mostly noise
|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  MaxSegmentSize  |  GuaranteedMaxSegSize  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   F64   |      I32      |      2^16      |       2^1        |           0            | 299.228 us |       2.89% | 302.101 us |       2.62% |    2.873 us |   0.96% |   SAME   |
|   F64   |      I32      |      2^20      |       2^1        |           0            |   4.589 ms |       0.01% |   4.628 ms |       0.37% |   39.271 us |   0.86% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^1        |           0            |  73.782 ms |       0.28% |  73.828 ms |       0.16% |   45.311 us |   0.06% |   SAME   |
|   F64   |      I32      |      2^28      |       2^1        |           0            |    1.187 s |       0.18% |    1.187 s |       0.23% | -865.347 us |  -0.07% |   SAME   |
|   F64   |      I32      |      2^16      |       2^2        |           0            | 179.187 us |       0.18% | 179.495 us |       0.25% |    0.308 us |   0.17% |   SAME   |
|   F64   |      I32      |      2^20      |       2^2        |           0            |   2.791 ms |       0.02% |   2.796 ms |       0.02% |    5.015 us |   0.18% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^2        |           0            |  44.565 ms |       0.02% |  44.642 ms |       0.02% |   77.415 us |   0.17% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^2        |           0            | 715.420 ms |       0.33% | 719.208 ms |       0.04% |    3.787 ms |   0.53% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^3        |           0            | 102.024 us |       0.48% | 102.209 us |       0.43% |    0.185 us |   0.18% |   SAME   |
|   F64   |      I32      |      2^20      |       2^3        |           0            |   1.563 ms |       0.03% |   1.567 ms |       0.03% |    3.601 us |   0.23% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^3        |           0            |  24.927 ms |       0.02% |  24.975 ms |       0.01% |   47.515 us |   0.19% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^3        |           0            | 398.874 ms |       0.06% | 400.128 ms |       0.26% |    1.254 ms |   0.31% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^4        |           0            |  57.161 us |       0.67% |  57.713 us |       0.84% |    0.551 us |   0.96% |   SLOW   |
|   F64   |      I32      |      2^20      |       2^4        |           0            | 831.222 us |       0.05% | 839.229 us |       0.06% |    8.008 us |   0.96% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^4        |           0            |  13.195 ms |       0.02% |  13.317 ms |       0.03% |  121.857 us |   0.92% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^4        |           0            | 211.196 ms |       0.05% | 213.127 ms |       0.05% |    1.931 ms |   0.91% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^5        |           0            |  32.401 us |       1.51% |  32.868 us |       1.12% |    0.467 us |   1.44% |   SLOW   |
|   F64   |      I32      |      2^20      |       2^5        |           0            | 430.792 us |       0.11% | 434.703 us |       0.11% |    3.910 us |   0.91% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^5        |           0            |   6.807 ms |       0.01% |   6.870 ms |       0.01% |   62.816 us |   0.92% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^5        |           0            | 108.824 ms |       0.04% | 109.827 ms |       0.04% |    1.003 ms |   0.92% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^6        |           0            |  19.694 us |       2.12% |  19.918 us |       2.54% |    0.224 us |   1.14% |   SAME   |
|   F64   |      I32      |      2^20      |       2^6        |           0            | 222.077 us |       0.19% | 224.189 us |       0.15% |    2.112 us |   0.95% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^6        |           0            |   3.466 ms |       0.03% |   3.497 ms |       0.01% |   31.405 us |   0.91% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^6        |           0            |  55.328 ms |       0.04% |  55.827 ms |       0.03% |  498.873 us |   0.90% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^7        |           0            |  12.685 us |       3.82% |  12.848 us |       4.06% |    0.163 us |   1.28% |   SAME   |
|   F64   |      I32      |      2^20      |       2^7        |           0            | 113.893 us |       0.37% | 114.965 us |       0.38% |    1.072 us |   0.94% |   SLOW   |
|   F64   |      I32      |      2^24      |       2^7        |           0            |   1.751 ms |       0.03% |   1.766 ms |       0.03% |   14.797 us |   0.84% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^7        |           0            |  27.946 ms |       0.03% |  28.180 ms |       0.04% |  233.165 us |   0.83% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^8        |           0            |   8.569 us |       7.03% |   8.682 us |       6.94% |    0.113 us |   1.32% |   SAME   |
|   F64   |      I32      |      2^20      |       2^8        |           0            |  60.553 us |       0.67% |  61.440 us |       0.00% |    0.887 us |   1.46% |   ????   |
|   F64   |      I32      |      2^24      |       2^8        |           0            | 885.989 us |       0.07% | 895.721 us |       0.07% |    9.732 us |   1.10% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^8        |           0            |  14.110 ms |       0.10% |  14.222 ms |       0.04% |  112.133 us |   0.79% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^9        |           0            |   6.187 us |       4.50% |   6.287 us |       5.63% |    0.100 us |   1.62% |   SAME   |
|   F64   |      I32      |      2^20      |       2^9        |           0            |  33.874 us |       1.62% |  34.257 us |       1.72% |    0.383 us |   1.13% |   SAME   |
|   F64   |      I32      |      2^24      |       2^9        |           0            | 459.912 us |       0.12% | 463.963 us |       0.11% |    4.051 us |   0.88% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^9        |           0            |   7.289 ms |       0.07% |   7.351 ms |       0.07% |   61.853 us |   0.85% |   SLOW   |
|   F64   |      I32      |      2^16      |       2^10       |           0            |   5.968 us |       7.03% |   5.958 us |       7.58% |   -0.011 us |  -0.18% |   SAME   |
|   F64   |      I32      |      2^20      |       2^10       |           0            |  23.399 us |       2.06% |  23.523 us |       2.07% |    0.124 us |   0.53% |   SAME   |
|   F64   |      I32      |      2^24      |       2^10       |           0            | 253.131 us |       0.53% | 254.684 us |       0.57% |    1.553 us |   0.61% |   SLOW   |
|   F64   |      I32      |      2^28      |       2^10       |           0            |   4.019 ms |       0.81% |   4.038 ms |       0.69% |   19.156 us |   0.48% |   SAME   |
|   F64   |      I32      |      2^16      |       2^11       |           0            |   6.266 us |       5.61% |   6.347 us |       6.55% |    0.082 us |   1.30% |   SAME   |
|   F64   |      I32      |      2^20      |       2^11       |           0            |  21.695 us |       2.68% |  21.805 us |       2.77% |    0.111 us |   0.51% |   SAME   |
|   F64   |      I32      |      2^24      |       2^11       |           0            | 201.555 us |       0.48% | 201.714 us |       0.53% |    0.158 us |   0.08% |   SAME   |
|   F64   |      I32      |      2^28      |       2^11       |           0            |   3.056 ms |       2.21% |   3.051 ms |       1.93% |   -4.644 us |  -0.15% |   SAME   |
|   F64   |      I32      |      2^16      |       2^12       |           0            |   7.283 us |       5.81% |   7.368 us |       5.58% |    0.085 us |   1.17% |   SAME   |
|   F64   |      I32      |      2^20      |       2^12       |           0            |  21.864 us |       2.23% |  21.773 us |       2.21% |   -0.091 us |  -0.42% |   SAME   |
|   F64   |      I32      |      2^24      |       2^12       |           0            | 201.261 us |       0.58% | 201.376 us |       0.52% |    0.115 us |   0.06% |   SAME   |
|   F64   |      I32      |      2^28      |       2^12       |           0            |   3.045 ms |       1.84% |   3.050 ms |       2.13% |    4.842 us |   0.16% |   SAME   |
|   F64   |      I32      |      2^16      |       2^13       |           0            |   8.259 us |       4.94% |   8.270 us |       5.37% |    0.011 us |   0.13% |   SAME   |
|   F64   |      I32      |      2^20      |       2^13       |           0            |  21.456 us |       1.89% |  21.536 us |       1.84% |    0.080 us |   0.37% |   SAME   |
|   F64   |      I32      |      2^24      |       2^13       |           0            | 202.011 us |       0.60% | 202.082 us |       0.59% |    0.071 us |   0.03% |   SAME   |
|   F64   |      I32      |      2^28      |       2^13       |           0            |   3.054 ms |       2.22% |   3.055 ms |       2.25% |    0.556 us |   0.02% |   SAME   |
|   F64   |      I32      |      2^16      |       2^14       |           0            |  10.232 us |       4.25% |  10.126 us |       4.20% |   -0.106 us |  -1.04% |   SAME   |
|   F64   |      I32      |      2^20      |       2^14       |           0            |  22.086 us |       2.30% |  21.955 us |       2.27% |   -0.131 us |  -0.59% |   SAME   |
|   F64   |      I32      |      2^24      |       2^14       |           0            | 203.555 us |       0.77% | 203.691 us |       0.82% |    0.136 us |   0.07% |   SAME   |
|   F64   |      I32      |      2^28      |       2^14       |           0            |   3.059 ms |       2.18% |   3.059 ms |       2.14% |   -0.411 us |  -0.01% |   SAME   |
|   F64   |      I32      |      2^16      |       2^15       |           0            |  11.851 us |       4.30% |  11.943 us |       4.15% |    0.092 us |   0.78% |   SAME   |
|   F64   |      I32      |      2^20      |       2^15       |           0            |  23.618 us |       1.82% |  23.627 us |       1.99% |    0.010 us |   0.04% |   SAME   |
|   F64   |      I32      |      2^24      |       2^15       |           0            | 207.015 us |       1.23% | 207.253 us |       1.31% |    0.238 us |   0.12% |   SAME   |
|   F64   |      I32      |      2^28      |       2^15       |           0            |   3.064 ms |       1.69% |   3.071 ms |       2.15% |    7.193 us |   0.23% |   SAME   |
|   F64   |      I32      |      2^16      |       2^16       |           0            |  17.059 us |       2.90% |  16.734 us |       2.84% |   -0.325 us |  -1.91% |   SAME   |
|   F64   |      I32      |      2^20      |       2^16       |           0            |  30.058 us |       1.95% |  30.139 us |       1.83% |    0.080 us |   0.27% |   SAME   |
|   F64   |      I32      |      2^24      |       2^16       |           0            | 213.066 us |       0.64% | 212.957 us |       0.62% |   -0.109 us |  -0.05% |   SAME   |
|   F64   |      I32      |      2^28      |       2^16       |           0            |   3.096 ms |       1.80% |   3.102 ms |       2.13% |    5.194 us |   0.17% |   SAME   |
sum T{ct}=I32 OffsetT=I32 - 3% regressions on small,medium segments which are already at < 5% SOL
|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  MaxSegmentSize  |  GuaranteedMaxSegSize  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I32   |      I32      |      2^16      |       2^1        |           0            |  95.039 us |       0.47% |  97.805 us |       0.51% |   2.766 us |   2.91% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^1        |           0            |   1.432 ms |       1.55% |   1.484 ms |       1.81% |  51.645 us |   3.61% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^1        |           0            |  22.713 ms |       0.15% |  23.497 ms |       0.25% | 783.818 us |   3.45% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^1        |           0            | 366.016 ms |       0.21% | 379.459 ms |       0.30% |  13.443 ms |   3.67% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^2        |           0            |  57.676 us |       0.82% |  59.692 us |       0.76% |   2.015 us |   3.49% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^2        |           0            | 867.866 us |       0.06% | 898.720 us |       0.06% |  30.854 us |   3.56% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^2        |           0            |  13.822 ms |       0.01% |  14.355 ms |       0.18% | 532.975 us |   3.86% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^2        |           0            | 221.955 ms |       0.06% | 231.240 ms |       0.21% |   9.285 ms |   4.18% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^3        |           0            |  33.792 us |       0.00% |  35.115 us |       1.31% |   1.323 us |   3.92% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^3        |           0            | 488.862 us |       0.11% | 506.678 us |       0.12% |  17.816 us |   3.64% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^3        |           0            |   7.749 ms |       0.20% |   8.062 ms |       0.08% | 313.457 us |   4.05% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^3        |           0            | 124.178 ms |       0.15% | 129.140 ms |       0.21% |   4.962 ms |   4.00% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^4        |           0            |  20.040 us |       2.52% |  20.695 us |       2.24% |   0.655 us |   3.27% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^4        |           0            | 262.468 us |       0.17% | 271.921 us |       0.24% |   9.453 us |   3.60% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^4        |           0            |   4.134 ms |       0.33% |   4.295 ms |       0.24% | 161.606 us |   3.91% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^4        |           0            |  66.198 ms |       0.17% |  69.006 ms |       0.39% |   2.808 ms |   4.24% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^5        |           0            |  12.366 us |       2.39% |  12.953 us |       3.75% |   0.587 us |   4.75% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^5        |           0            | 138.086 us |       0.28% | 143.083 us |       0.36% |   4.998 us |   3.62% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^5        |           0            |   2.148 ms |       0.12% |   2.252 ms |       0.53% | 103.734 us |   4.83% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^5        |           0            |  34.582 ms |       0.24% |  36.016 ms |       0.06% |   1.435 ms |   4.15% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^6        |           0            |   8.401 us |       4.69% |   8.738 us |       5.82% |   0.338 us |   4.02% |   SAME   |
|   I32   |      I32      |      2^20      |       2^6        |           0            |  72.855 us |       0.49% |  74.937 us |       0.63% |   2.082 us |   2.86% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^6        |           0            |   1.108 ms |       0.27% |   1.150 ms |       0.55% |  42.315 us |   3.82% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^6        |           0            |  17.766 ms |       0.33% |  18.386 ms |       0.30% | 620.102 us |   3.49% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^7        |           0            |   6.208 us |       4.40% |   6.431 us |       7.01% |   0.223 us |   3.59% |   SAME   |
|   I32   |      I32      |      2^20      |       2^7        |           0            |  38.896 us |       0.82% |  40.027 us |       0.94% |   1.131 us |   2.91% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^7        |           0            | 568.151 us |       0.77% | 587.338 us |       1.09% |  19.186 us |   3.38% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^7        |           0            |   9.102 ms |       0.35% |   9.445 ms |       0.37% | 343.163 us |   3.77% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^8        |           0            |   4.972 us |       8.27% |   5.136 us |       5.90% |   0.164 us |   3.30% |   SAME   |
|   I32   |      I32      |      2^20      |       2^8        |           0            |  22.528 us |       0.00% |  22.773 us |       1.92% |   0.245 us |   1.09% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^8        |           0            | 294.407 us |       1.32% | 304.885 us |       1.97% |  10.478 us |   3.56% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^8        |           0            |   4.730 ms |       0.53% |   4.899 ms |       0.76% | 168.848 us |   3.57% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^9        |           0            |   4.577 us |      11.03% |   4.764 us |      10.43% |   0.188 us |   4.10% |   SAME   |
|   I32   |      I32      |      2^20      |       2^9        |           0            |  14.418 us |       2.38% |  14.766 us |       3.29% |   0.348 us |   2.41% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^9        |           0            | 156.072 us |       0.98% | 159.647 us |       1.37% |   3.574 us |   2.29% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^9        |           0            |   2.479 ms |       0.93% |   2.551 ms |       0.96% |  71.818 us |   2.90% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^10       |           0            |   5.351 us |       7.82% |   5.364 us |       8.01% |   0.013 us |   0.24% |   SAME   |
|   I32   |      I32      |      2^20      |       2^10       |           0            |  12.572 us |       3.62% |  12.758 us |       3.92% |   0.186 us |   1.48% |   SAME   |
|   I32   |      I32      |      2^24      |       2^10       |           0            | 112.268 us |       0.57% | 113.339 us |       0.65% |   1.071 us |   0.95% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^10       |           0            |   1.709 ms |       1.72% |   1.726 ms |       1.59% |  17.889 us |   1.05% |   SAME   |
|   I32   |      I32      |      2^16      |       2^11       |           0            |   5.617 us |       8.98% |   5.592 us |       9.03% |  -0.024 us |  -0.43% |   SAME   |
|   I32   |      I32      |      2^20      |       2^11       |           0            |  12.150 us |       3.12% |  12.226 us |       2.78% |   0.076 us |   0.62% |   SAME   |
|   I32   |      I32      |      2^24      |       2^11       |           0            | 104.724 us |       0.58% | 104.815 us |       0.60% |   0.091 us |   0.09% |   SAME   |
|   I32   |      I32      |      2^28      |       2^11       |           0            |   1.543 ms |       2.95% |   1.540 ms |       2.76% |  -2.551 us |  -0.17% |   SAME   |
|   I32   |      I32      |      2^16      |       2^12       |           0            |   5.738 us |       8.74% |   5.866 us |       7.97% |   0.127 us |   2.22% |   SAME   |
|   I32   |      I32      |      2^20      |       2^12       |           0            |  12.280 us |       2.20% |  12.431 us |       3.23% |   0.151 us |   1.23% |   SAME   |
|   I32   |      I32      |      2^24      |       2^12       |           0            | 104.518 us |       0.57% | 104.591 us |       0.68% |   0.073 us |   0.07% |   SAME   |
|   I32   |      I32      |      2^28      |       2^12       |           0            |   1.539 ms |       2.99% |   1.537 ms |       2.85% |  -2.096 us |  -0.14% |   SAME   |
|   I32   |      I32      |      2^16      |       2^13       |           0            |   6.003 us |       6.57% |   6.144 us |       0.00% |   0.141 us |   2.34% |   ????   |
|   I32   |      I32      |      2^20      |       2^13       |           0            |  12.544 us |       3.55% |  12.679 us |       3.83% |   0.135 us |   1.08% |   SAME   |
|   I32   |      I32      |      2^24      |       2^13       |           0            | 104.749 us |       0.61% | 104.719 us |       0.60% |  -0.030 us |  -0.03% |   SAME   |
|   I32   |      I32      |      2^28      |       2^13       |           0            |   1.537 ms |       2.90% |   1.540 ms |       3.07% |   2.855 us |   0.19% |   SAME   |
|   I32   |      I32      |      2^16      |       2^14       |           0            |   6.949 us |       6.40% |   7.013 us |       6.01% |   0.064 us |   0.93% |   SAME   |
|   I32   |      I32      |      2^20      |       2^14       |           0            |  13.147 us |       2.94% |  13.102 us |       3.13% |  -0.045 us |  -0.35% |   SAME   |
|   I32   |      I32      |      2^24      |       2^14       |           0            | 105.483 us |       0.63% | 105.510 us |       0.53% |   0.027 us |   0.03% |   SAME   |
|   I32   |      I32      |      2^28      |       2^14       |           0            |   1.541 ms |       2.99% |   1.542 ms |       3.04% |   0.660 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^16      |       2^15       |           0            |   8.164 us |       4.46% |   8.244 us |       4.29% |   0.080 us |   0.98% |   SAME   |
|   I32   |      I32      |      2^20      |       2^15       |           0            |  14.662 us |       3.22% |  14.644 us |       3.16% |  -0.018 us |  -0.12% |   SAME   |
|   I32   |      I32      |      2^24      |       2^15       |           0            | 106.360 us |       0.64% | 106.335 us |       0.65% |  -0.025 us |  -0.02% |   SAME   |
|   I32   |      I32      |      2^28      |       2^15       |           0            |   1.549 ms |       2.82% |   1.550 ms |       2.87% |   0.880 us |   0.06% |   SAME   |
|   I32   |      I32      |      2^16      |       2^16       |           0            |  10.166 us |       3.95% |  10.264 us |       3.46% |   0.098 us |   0.97% |   SAME   |
|   I32   |      I32      |      2^20      |       2^16       |           0            |  17.323 us |       2.11% |  17.236 us |       2.45% |  -0.088 us |  -0.51% |   SAME   |
|   I32   |      I32      |      2^24      |       2^16       |           0            | 107.208 us |       0.70% | 107.083 us |       0.69% |  -0.126 us |  -0.12% |   SAME   |
|   I32   |      I32      |      2^28      |       2^16       |           0            |   1.596 ms |       2.81% |   1.593 ms |       2.68% |  -2.879 us |  -0.18% |   SAME   |
argmax T{ct}=I32 OffsetT=I32 - 3% regressions on small,medium segments which are already at < 5% SOL
|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  MaxSegmentSize  |  GuaranteedMaxSegSize  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I32   |      I32      |      2^16      |       2^1        |           0            | 120.364 us |       0.42% | 122.413 us |       0.41% |   2.050 us |   1.70% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^1        |           0            |   1.780 ms |       1.37% |   1.817 ms |       1.05% |  36.577 us |   2.05% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^1        |           0            |  28.376 ms |       0.29% |  29.061 ms |       0.29% | 684.188 us |   2.41% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^1        |           0            | 457.679 ms |       0.23% | 470.346 ms |       0.34% |  12.668 ms |   2.77% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^2        |           0            |  71.057 us |       0.68% |  73.014 us |       0.62% |   1.957 us |   2.75% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^2        |           0            |   1.078 ms |       0.05% |   1.107 ms |       0.05% |  28.218 us |   2.62% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^2        |           0            |  17.235 ms |       0.16% |  17.746 ms |       0.03% | 510.669 us |   2.96% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^2        |           0            | 276.427 ms |       0.22% | 284.911 ms |       0.30% |   8.484 ms |   3.07% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^3        |           0            |  41.595 us |       1.18% |  42.652 us |       1.16% |   1.056 us |   2.54% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^3        |           0            | 606.212 us |       0.07% | 624.418 us |       0.27% |  18.206 us |   3.00% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^3        |           0            |   9.645 ms |       0.21% |   9.963 ms |       0.24% | 318.093 us |   3.30% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^3        |           0            | 154.748 ms |       0.15% | 159.708 ms |       0.29% |   4.960 ms |   3.21% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^4        |           0            |  24.251 us |       1.96% |  24.720 us |       1.46% |   0.469 us |   1.94% |   SLOW   |
|   I32   |      I32      |      2^20      |       2^4        |           0            | 323.995 us |       0.15% | 338.753 us |       0.92% |  14.758 us |   4.56% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^4        |           0            |   5.129 ms |       0.23% |   5.322 ms |       0.51% | 193.571 us |   3.77% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^4        |           0            |  82.383 ms |       0.35% |  85.480 ms |       0.42% |   3.096 ms |   3.76% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^5        |           0            |  14.804 us |       3.40% |  15.130 us |       3.01% |   0.325 us |   2.20% |   SAME   |
|   I32   |      I32      |      2^20      |       2^5        |           0            | 170.604 us |       0.29% | 174.819 us |       0.36% |   4.215 us |   2.47% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^5        |           0            |   2.675 ms |       0.46% |   2.772 ms |       0.50% |  97.344 us |   3.64% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^5        |           0            |  42.926 ms |       0.40% |  44.392 ms |       0.02% |   1.466 ms |   3.42% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^6        |           0            |   9.798 us |       5.18% |  10.046 us |       4.36% |   0.248 us |   2.53% |   SAME   |
|   I32   |      I32      |      2^20      |       2^6        |           0            |  90.135 us |       0.26% |  92.106 us |       0.56% |   1.971 us |   2.19% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^6        |           0            |   1.394 ms |       0.58% |   1.433 ms |       0.56% |  39.509 us |   2.83% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^6        |           0            |  22.324 ms |       0.25% |  22.910 ms |       0.28% | 586.602 us |   2.63% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^7        |           0            |   7.102 us |       4.99% |   7.294 us |       4.49% |   0.192 us |   2.70% |   SAME   |
|   I32   |      I32      |      2^20      |       2^7        |           0            |  48.642 us |       1.04% |  49.852 us |       1.10% |   1.209 us |   2.49% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^7        |           0            | 726.412 us |       0.76% | 751.533 us |       1.15% |  25.122 us |   3.46% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^7        |           0            |  11.683 ms |       0.40% |  12.065 ms |       0.43% | 381.657 us |   3.27% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^8        |           0            |   5.804 us |       8.60% |   5.915 us |       7.24% |   0.111 us |   1.91% |   SAME   |
|   I32   |      I32      |      2^20      |       2^8        |           0            |  28.326 us |       1.67% |  28.852 us |       1.87% |   0.526 us |   1.86% |   SLOW   |
|   I32   |      I32      |      2^24      |       2^8        |           0            | 392.862 us |       1.76% | 404.859 us |       2.05% |  11.996 us |   3.05% |   SLOW   |
|   I32   |      I32      |      2^28      |       2^8        |           0            |   6.331 ms |       0.63% |   6.528 ms |       0.80% | 196.934 us |   3.11% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^9        |           0            |   5.450 us |       8.77% |   5.517 us |       8.85% |   0.067 us |   1.23% |   SAME   |
|   I32   |      I32      |      2^20      |       2^9        |           0            |  19.336 us |       1.93% |  19.632 us |       2.06% |   0.296 us |   1.53% |   SAME   |
|   I32   |      I32      |      2^24      |       2^9        |           0            | 235.140 us |       2.39% | 240.575 us |       2.53% |   5.435 us |   2.31% |   SAME   |
|   I32   |      I32      |      2^28      |       2^9        |           0            |   3.815 ms |       1.09% |   3.906 ms |       1.14% |  91.290 us |   2.39% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^10       |           0            |   6.333 us |       6.12% |   6.442 us |       7.22% |   0.109 us |   1.73% |   SAME   |
|   I32   |      I32      |      2^20      |       2^10       |           0            |  15.535 us |       2.63% |  15.723 us |       3.00% |   0.188 us |   1.21% |   SAME   |
|   I32   |      I32      |      2^24      |       2^10       |           0            | 156.830 us |       2.40% | 158.991 us |       2.26% |   2.161 us |   1.38% |   SAME   |
|   I32   |      I32      |      2^28      |       2^10       |           0            |   2.520 ms |       1.43% |   2.577 ms |       1.47% |  56.647 us |   2.25% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^11       |           0            |   8.214 us |       5.71% |   8.029 us |       6.43% |  -0.185 us |  -2.25% |   SAME   |
|   I32   |      I32      |      2^20      |       2^11       |           0            |  14.887 us |       3.24% |  14.824 us |       3.32% |  -0.063 us |  -0.43% |   SAME   |
|   I32   |      I32      |      2^24      |       2^11       |           0            | 123.651 us |       1.60% | 124.777 us |       1.89% |   1.127 us |   0.91% |   SAME   |
|   I32   |      I32      |      2^28      |       2^11       |           0            |   1.918 ms |       1.11% |   1.941 ms |       1.12% |  23.070 us |   1.20% |   SLOW   |
|   I32   |      I32      |      2^16      |       2^12       |           0            |  10.430 us |       5.10% |  10.737 us |       4.69% |   0.307 us |   2.94% |   SAME   |
|   I32   |      I32      |      2^20      |       2^12       |           0            |  15.319 us |       2.45% |  15.506 us |       2.52% |   0.187 us |   1.22% |   SAME   |
|   I32   |      I32      |      2^24      |       2^12       |           0            | 108.896 us |       0.59% | 108.975 us |       0.62% |   0.078 us |   0.07% |   SAME   |
|   I32   |      I32      |      2^28      |       2^12       |           0            |   1.567 ms |       2.42% |   1.565 ms |       2.27% |  -1.359 us |  -0.09% |   SAME   |
|   I32   |      I32      |      2^16      |       2^13       |           0            |  11.493 us |       5.11% |  11.292 us |       4.98% |  -0.201 us |  -1.75% |   SAME   |
|   I32   |      I32      |      2^20      |       2^13       |           0            |  16.382 us |       2.84% |  16.338 us |       3.16% |  -0.044 us |  -0.27% |   SAME   |
|   I32   |      I32      |      2^24      |       2^13       |           0            | 108.935 us |       0.67% | 108.752 us |       0.67% |  -0.183 us |  -0.17% |   SAME   |
|   I32   |      I32      |      2^28      |       2^13       |           0            |   1.547 ms |       2.95% |   1.546 ms |       2.85% |  -1.858 us |  -0.12% |   SAME   |
|   I32   |      I32      |      2^16      |       2^14       |           0            |  15.088 us |       3.28% |  13.893 us |       4.02% |  -1.195 us |  -7.92% |   FAST   |
|   I32   |      I32      |      2^20      |       2^14       |           0            |  18.519 us |       2.42% |  17.982 us |       2.78% |  -0.538 us |  -2.90% |   FAST   |
|   I32   |      I32      |      2^24      |       2^14       |           0            | 111.082 us |       0.71% | 110.828 us |       0.69% |  -0.254 us |  -0.23% |   SAME   |
|   I32   |      I32      |      2^28      |       2^14       |           0            |   1.550 ms |       2.91% |   1.544 ms |       2.53% |  -6.161 us |  -0.40% |   SAME   |
|   I32   |      I32      |      2^16      |       2^15       |           0            |  18.822 us |       3.67% |  18.281 us |       3.31% |  -0.541 us |  -2.88% |   SAME   |
|   I32   |      I32      |      2^20      |       2^15       |           0            |  26.684 us |       1.57% |  25.841 us |       1.76% |  -0.843 us |  -3.16% |   FAST   |
|   I32   |      I32      |      2^24      |       2^15       |           0            | 114.883 us |       1.08% | 114.281 us |       1.17% |  -0.601 us |  -0.52% |   SAME   |
|   I32   |      I32      |      2^28      |       2^15       |           0            |   1.548 ms |       2.07% |   1.555 ms |       2.70% |   7.483 us |   0.48% |   SAME   |
|   I32   |      I32      |      2^16      |       2^16       |           0            |  29.053 us |       3.66% |  27.047 us |       3.15% |  -2.007 us |  -6.91% |   FAST   |
|   I32   |      I32      |      2^20      |       2^16       |           0            |  46.523 us |       1.19% |  43.592 us |       1.34% |  -2.931 us |  -6.30% |   FAST   |
|   I32   |      I32      |      2^24      |       2^16       |           0            | 117.022 us |       0.80% | 116.438 us |       0.85% |  -0.584 us |  -0.50% |   SAME   |
|   I32   |      I32      |      2^28      |       2^16       |           0            |   1.636 ms |       2.49% |   1.638 ms |       2.46% |   2.322 us |   0.14% |   SAME   |

@github-actions

This comment has been minimized.

Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
@bernhardmgruber
Copy link
Copy Markdown
Contributor

I had a quick call with @srinivasyadav18 and here are some notes:

  • This PR adds code paths to support small and medium segments using different algorithms
  • If max_segment_size == 0 then the current implementation will always take the large segments code path, which was the status quo before this PR
  • We pass max_segment_size == 0 everywhere at the CUB API and in CCCL.C, so we retain the status quo implementation for the public API (will be made reachable by a future PR)
  • New tests exercise the new code paths
  • The regressions for the benchmarks above are not known, but could be due to the increased code size of the kernel or a change in shared memory requirments.
  • Where we see regressions, those are for runs where the overall performance is already very bad
  • Enabling the small and medium code paths in the future will yield massive improvements

@github-actions

This comment has been minimized.

Comment thread cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh Outdated
Comment thread cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks ok to me.

Comment thread cub/benchmarks/bench/segmented_reduce/variable_argmax.cu Outdated
Comment thread cub/benchmarks/bench/segmented_reduce/variable_base.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 18m: Pass: 100%/255 | Total: 8d 19h | Max: 2h 18m | Hits: 70%/157047

See results here.

@srinivasyadav18 srinivasyadav18 enabled auto-merge (squash) March 10, 2026 00:40
@srinivasyadav18
Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@srinivasyadav18 srinivasyadav18 merged commit 93902ac into NVIDIA:main Mar 10, 2026
272 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Optimize device_segment_reduce for small and medium varaible segment size's

5 participants