Skip to content

imageUtilKernels.cu: optimize initAttentionMaskKernel#28

Open
vc-zju wants to merge 1 commit intoNVIDIA:mainfrom
vc-zju:main
Open

imageUtilKernels.cu: optimize initAttentionMaskKernel#28
vc-zju wants to merge 1 commit intoNVIDIA:mainfrom
vc-zju:main

Conversation

@vc-zju
Copy link

@vc-zju vc-zju commented Jan 26, 2026

What does this PR do?

Type of change: Optimization

Overview:
Simply change the loop order in imageUtilKernels.cu:initAttentionMaskKernel. Take Qwen2.5-VL as example, optimize the kernel from 2.8ms to 0.4ms for 640*640 image input.
Before
image

After
image

Testing

Just use nsys to profile it and you can see it.

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Just reuse previous test
  • Did you add or update any necessary documentation?: No need to do that
  • Did you update Changelog?: No

…ing loop index

Signed-off-by: Tiekai Bi <tiekaib@nvidia.com>
@vc-zju vc-zju requested a review from a team January 26, 2026 12:03
@fans-nv fans-nv assigned fans-nv and vc-zju and unassigned fans-nv Jan 26, 2026
@fans-nv fans-nv self-requested a review January 26, 2026 12:11
@fans-nv
Copy link

fans-nv commented Jan 26, 2026

Would you mind sharing on which platform did you collect the performance data?
And very nice catch for the perf issue, thanks for the contribution.

@fans-nv
Copy link

fans-nv commented Jan 26, 2026

Similar to to other community PRs, I need to cherry-pick to internal branch and then publish the change to main along with next release.

@vc-zju
Copy link
Author

vc-zju commented Jan 26, 2026

Sure, I collect it on jetson-Thor. I can also move this to internal gitlab if you think we need to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants