[Optimization] GPU vector-pack writes across elementwise and geometric operators by zacharyvincze · Pull Request #166 · ROCm/rocCV

zacharyvincze · 2026-06-23T21:24:32Z

Summary

Introduces packed_apply.hpp, a reusable helper for wide-vector (16-byte) GPU stores, and applies it across all elementwise and geometric operators. This amortizes per-thread setup and eliminates sub-word store penalties for small pixel types (notably uchar3), yielding measurable throughput gains on memory-bound kernels.

Changes

New helper — include/kernels/device/packed_apply.hpp

ApplyPackedGather — scattered reads (e.g. geometric transforms) + packed output store
ApplyPackedTransform — packed read + packed store for same-coordinate elementwise ops
PackedGrid<T> / PackedBlock() — launch helpers that size grids so each thread handles a PackWidth<T>-pixel run
Runtime alignment guard: falls back to scalar stores on unaligned rows/tails

Kernel updates

All elementwise operators (brightness_contrast, convert_to, cvt_color, flip, gamma_contrast, normalize, thresholding, composite, custom_crop) migrated to ApplyPackedGather or ApplyPackedTransform
Geometric operators (resize, rotate, remap, copy_make_border, warp_affine, warp_perspective) use ApplyPackedGather for packed output writes

Reverted experiments

Multi-input packed reads+writes and host-side Composite latency reduction were explored and reverted — gather-bound (scattered-read) kernels did not benefit from packed input reads

…en packed reads and packed reads+writes

This reverts commit e5613d0.

This reverts commit 664e8aa.

…ization

zacharyvincze · 2026-06-24T00:09:17Z

🟢 Net win — broad GPU speedups, no real regressions

Vector-pack writes deliver large wins across the elementwise and geometric operators (11 of 15 categories faster, medians from -5% to -41%), with the U8 fast paths seeing the biggest gains (BrightnessContrast/CustomCrop/Flip up to -86%). The four neutral operators (WarpAffine, WarpPerspective, BilateralFilter, Histogram) are unchanged as expected, and the handful of "+" rows are sub-noise F32 jitter.

GPU target gfx1100 · baseline = upstream ROCm/rocCV develop · candidate = zacharyvincze:zv/optimization/geometric-transform-optimization · 500 configs (500 matched, no added/removed) · GPU clocks locked (perf high)

Per-operator summary

Operator	Configs	Median Δ	Range	Class
Resize	15	-41.3%	-45.3% … -25.5%	🟢
CustomCrop	20	-39.1%	-81.5% … +3.8%	🟢
Normalize	15	-36.8%	-63.9% … -6.2%	🟢
Threshold	15	-34.7%	-57.6% … -12.2%	🟢
CvtColor	5	-31.5%	-37.5% … -30.9%	🟢
BrightnessContrast	75	-26.9%	-86.3% … +2.3%	🟢
Rotate	60	-23.0%	-32.1% … -7.4%	🟢
CopyMakeBorder	75	-18.1%	-43.9% … +7.6%	🟢
GammaContrast	15	-18.0%	-23.4% … -1.6%	🟢
Flip	20	-13.4%	-74.3% … +3.9%	🟢
Composite	10	-5.4%	-49.0% … +0.3%	🟢
WarpAffine	75	-0.0%	-0.9% … +1.6%	⚪
WarpPerspective	75	-0.0%	-1.4% … +0.3%	⚪
BilateralFilter	15	-0.0%	-0.9% … +0.2%	⚪
Histogram	10	+0.4%	-0.2% … +0.9%	⚪

Spread by interpolation × border (categories with >5% spread)

Operator	Interp	Border	n	Median Δ	Range
CopyMakeBorder	—	CONSTANT	15	-33.9%	-43.9% … -1.0%
CopyMakeBorder	—	REFLECT	15	-15.8%	-34.4% … +3.6%
CopyMakeBorder	—	REFLECT101	15	-17.8%	-34.0% … +3.4%
CopyMakeBorder	—	REPLICATE	15	-24.6%	-33.9% … +4.8%
CopyMakeBorder	—	WRAP	15	-17.5%	-34.6% … +7.6%
Resize	LINEAR	—	15	-41.3%	-45.3% … -25.5%
Rotate	CUBIC	—	20	-16.1%	-21.1% … -7.4%
Rotate	LINEAR	—	20	-23.0%	-28.4% … -19.2%
Rotate	NEAREST	—	20	-29.1%	-32.1% … -28.1%

Notable regressions

Operator	Config	Baseline (ms)	Candidate (ms)	Δ
CopyMakeBorder	16x1080x1920 WRAP FMT_F32	0.432	0.464	+7.6%
CopyMakeBorder	16x1080x1920 REPLICATE FMT_F32	0.439	0.460	+4.8%
Flip	32x1080x1920 FMT_F32	0.764	0.793	+3.9%
CustomCrop	128x1080x1920 FMT_F32	0.215	0.223	+3.8%
CustomCrop	128x1080x1920 FMT_RGBA8	0.214	0.222	+3.7%
CopyMakeBorder	16x1080x1920 REFLECT FMT_F32	0.442	0.458	+3.6%
Flip	32x1080x1920 FMT_RGBA8	0.764	0.791	+3.5%
CopyMakeBorder	16x1080x1920 REFLECT101 FMT_F32	0.447	0.462	+3.4%

Notable improvements

Operator	Config	Baseline (ms)	Candidate (ms)	Δ
BrightnessContrast	16x1080x1920 FMT_U8	0.365	0.050	-86.3%
BrightnessContrast	16x1080x1920 FMT_U8	0.367	0.050	-86.2%
BrightnessContrast	16x1080x1920 FMT_U8	0.290	0.049	-83.2%
CustomCrop	128x1080x1920 FMT_U8	0.159	0.029	-81.5%
CustomCrop	64x1080x1920 FMT_U8	0.087	0.020	-76.6%
Flip	16x1080x1920 FMT_U8	0.311	0.080	-74.3%
BrightnessContrast	64x1080x1920 FMT_U8	1.497	0.391	-73.9%
BrightnessContrast	64x1080x1920 FMT_U8	1.497	0.391	-73.9%

Analysis

This PR is unambiguously net-positive. All 500 configs matched (no added/removed configs, so no match_status concerns), and the operators touched by the vector-pack change show consistent, large speedups — Resize (-41%), CustomCrop (-39%), Normalize (-37%), Threshold (-35%), CvtColor (-32%), BrightnessContrast (-27%), and Rotate (-23%) all improve cleanly across their entire config range. The most dramatic wins are concentrated in the packed U8/RGBA8 write paths, where coalesced vector stores cut elementwise kernels by 70-86%.

The "regressions" are not real signal. Every "+" row is an F32 case at the smallest batch sizes (16-128 images at 1080×1920), with deltas of +3-8% on sub-0.5 ms kernels — well within run-to-run jitter for these short F32 paths that don't benefit from the U8 packing. CopyMakeBorder's positive rows are confined to F32 borders; its U8 paths still improve substantially (CONSTANT median -34%).

The four neutral operators are exactly the ones the PR doesn't touch on the write side: WarpAffine, WarpPerspective, and BilateralFilter are gather/compute-bound (packing writes doesn't help, consistent with prior findings that packing hurts gather-bound warps — here it's correctly left as a no-op), and Histogram is reduction-bound. Their ~0% deltas confirm the change is well-scoped and introduces no overhead where it can't help. Recommend merge.

zacharyvincze added 9 commits June 23, 2026 00:55

Add packed reads/writes to resize operator

06bd74a

Apply packed reads across all GPU kernels

d6896a2

Remove pragma unrolls

93124eb

Implement ApplyPackedGather/ApplyPackedTransform to distinguish betwe…

1dcc232

…en packed reads and packed reads+writes

Implement multi-input packed reads+writes mechanism

e5613d0

Revert "Implement multi-input packed reads+writes mechanism"

24e3f2f

This reverts commit e5613d0.

Reduce host launch latency for Composite

664e8aa

Revert "Reduce host launch latency for Composite"

26a324a

This reverts commit 664e8aa.

Simplify packed_apply.hpp implementations

9099c94

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' into zv/optimization/geometric-transform-optim…

2743b34

…ization

This comment was marked as outdated.

Sign in to view

zacharyvincze added 2 commits June 23, 2026 19:24

Add fast path for U8/RGB8/RGBA8 float -> uchar SaturateCast

11b4730

Use SaturateCast for composite device kernel

e64b19a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Optimization] GPU vector-pack writes across elementwise and geometric operators#166

[Optimization] GPU vector-pack writes across elementwise and geometric operators#166
zacharyvincze wants to merge 12 commits into
ROCm:developfrom
zacharyvincze:zv/optimization/geometric-transform-optimization

zacharyvincze commented Jun 23, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

zacharyvincze commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zacharyvincze commented Jun 23, 2026

Summary

Changes

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

zacharyvincze commented Jun 24, 2026

🟢 Net win — broad GPU speedups, no real regressions

Per-operator summary

Spread by interpolation × border (categories with >5% spread)

Notable regressions

Notable improvements

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant