win: re-enable and fix cuDNN performance by dhiltgen · Pull Request #3242 · ml-explore/mlx

dhiltgen · 2026-03-11T17:23:32Z

Proposed changes

Populating a fresh cuDNN CUDA graph for each layer, and adding that new graph to the overall MLX CUDA graph is costly with WDDM. To resolve this we cache the graph (first call does the expensive populate_cuda_graph, subsequent calls only patch pointers via update_cuda_graph), and we cache the subgraph key to avoid the overhead of recomputing (kernel attribute queries that hit WDDM round-trip overhead.)

SDPACacheKey has bool fields adjacent to int64_t arrays which causes padding bytes for alignment. The BytesKey constructor memsets everything to zero, but the aggregate init cache_key.pod = {...} creates a stack temporary with uninitialized padding, and the compiler's trivial copy-assignment copies the entire struct — including the garbage padding — over the zeroed bytes. Since BytesKey uses memcmp for equality, every SDPA call produces a unique key and is a cache miss.

Results (RTX 5090, Windows 11 WDDM, mlx_lm benchmark -p 2048 -g 128):

Model	Metric	main	After Fix	Change
Llama-3.2-3B-4bit	Prefill (tok/s)	1,228	2,436	+98%
Llama-3.2-3B-4bit	Gen (tok/s)	371	385	+4%
Qwen3-8B-4bit	Prefill (tok/s)	494	917	+86%
Qwen3-8B-4bit	Gen (tok/s)	220	229	+4%
Llama-3.2-3B-bf16	Prefill (tok/s)	19,157	19,347	+1%
Llama-3.2-3B-bf16	Gen (tok/s)	200	199	~0%
Qwen3-8B-bf16	Prefill (tok/s)	9,254	9,253	~0%
Qwen3-8B-bf16	Gen (tok/s)	91	91	~0%

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Populating a fresh cuDNN CUDA graph for each layer, and adding that new graph to the overall MLX CUDA graph is costly with WDDM. To resolve this we cache the graph (first call does the expensive populate_cuda_graph, subsequent calls only patch pointers via update_cuda_graph), and we cache the subgraph key to avoid the overhead of recomputing (kernel attribute queries that hit WDDM round-trip overhead.)

SDPACacheKey has bool fields adjacent to int64_t arrays which causes padding bytes for alignment. The BytesKey constructor memsets everything to zero, but the aggregate init cache_key.pod = {...} creates a stack temporary with uninitialized padding, and the compiler's trivial copy-assignment copies the entire struct — including the garbage padding — over the zeroed bytes. Since BytesKey uses memcmp for equality, every SDPA call produces a unique key and is a cache miss.

zcbenz

This is not the first time we were bitten by the padding bytes, thanks a lot for the awesome fix!

dhiltgen added 2 commits March 11, 2026 08:48

zcbenz approved these changes Mar 11, 2026

View reviewed changes

Comment thread mlx/backend/cuda/scaled_dot_product_attention.cpp Outdated

review comments

dbfb2cd

zcbenz merged commit 7adfc83 into ml-explore:main Mar 13, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

win: re-enable and fix cuDNN performance#3242

win: re-enable and fix cuDNN performance#3242
zcbenz merged 3 commits intoml-explore:mainfrom
dhiltgen:win_cuDNN_fix

dhiltgen commented Mar 11, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhiltgen commented Mar 11, 2026

Proposed changes

Checklist

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants