Skip to content

Conversation

@WolfWings
Copy link

The existing code has a series of 8 sequential unrolled PEXTRW, which compilers generally cannot detect and optimize to a single MOVDQU instruction.

As such manually placing the optimized unaligned store intrinsic in place is an enormous performance win for SSE with identical output.

The existing code has a series of 8 sequential unrolled PEXTRW, which compilers generally cannot detect and optimize to a single MOVDQU instruction.

As such manually placing the optimized unaligned store intrinsic in place is an enormous performance win for SSE with identical output.
@WolfWings
Copy link
Author

Some recent versions of clang can identify this construct, but older ones cannot and no version of GCC I was able to test could make this optimization.

I chose the storeu as loadu is used elsewhere instead of juggling alignment issues, so the occasional extra cycle of latency to match existing code design seemed appropriate to also minimize the code change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant