[Optimization] GPU vector-pack writes across elementwise and geometric operators#166
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
🟢 Net win — broad GPU speedups, no real regressionsVector-pack writes deliver large wins across the elementwise and geometric operators (11 of 15 categories faster, medians from -5% to -41%), with the U8 fast paths seeing the biggest gains (BrightnessContrast/CustomCrop/Flip up to -86%). The four neutral operators (WarpAffine, WarpPerspective, BilateralFilter, Histogram) are unchanged as expected, and the handful of "+" rows are sub-noise F32 jitter. GPU target gfx1100 · baseline = upstream Per-operator summary
Spread by interpolation × border (categories with >5% spread)
Notable regressions
Notable improvements
AnalysisThis PR is unambiguously net-positive. All 500 configs matched (no added/removed configs, so no The "regressions" are not real signal. Every "+" row is an F32 case at the smallest batch sizes (16-128 images at 1080×1920), with deltas of +3-8% on sub-0.5 ms kernels — well within run-to-run jitter for these short F32 paths that don't benefit from the U8 packing. CopyMakeBorder's positive rows are confined to F32 borders; its U8 paths still improve substantially (CONSTANT median -34%). The four neutral operators are exactly the ones the PR doesn't touch on the write side: WarpAffine, WarpPerspective, and BilateralFilter are gather/compute-bound (packing writes doesn't help, consistent with prior findings that packing hurts gather-bound warps — here it's correctly left as a no-op), and Histogram is reduction-bound. Their ~0% deltas confirm the change is well-scoped and introduces no overhead where it can't help. Recommend merge. |
Summary
Introduces
packed_apply.hpp, a reusable helper for wide-vector (16-byte) GPU stores, and applies it across all elementwise and geometric operators. This amortizes per-thread setup and eliminates sub-word store penalties for small pixel types (notablyuchar3), yielding measurable throughput gains on memory-bound kernels.Changes
New helper —
include/kernels/device/packed_apply.hppApplyPackedGather— scattered reads (e.g. geometric transforms) + packed output storeApplyPackedTransform— packed read + packed store for same-coordinate elementwise opsPackedGrid<T>/PackedBlock()— launch helpers that size grids so each thread handles aPackWidth<T>-pixel runKernel updates
brightness_contrast,convert_to,cvt_color,flip,gamma_contrast,normalize,thresholding,composite,custom_crop) migrated toApplyPackedGatherorApplyPackedTransformresize,rotate,remap,copy_make_border,warp_affine,warp_perspective) useApplyPackedGatherfor packed output writesReverted experiments