Skip to content

Add mmq device table for RDNA3.5#25

Open
Annieren wants to merge 2 commits into
gfx11from
annier.mmq-device-table
Open

Add mmq device table for RDNA3.5#25
Annieren wants to merge 2 commits into
gfx11from
annier.mmq-device-table

Conversation

@Annieren

@Annieren Annieren commented Jun 17, 2026

Copy link
Copy Markdown

Overview

Add mmq device table for RDNA3.5.

  • get_mmq_y_host — returns 64 for RDNA3.5 (host).
  • get_mmq_y_device — returns 64 under #if defined(RDNA3_5) (device).
  • mmq_get_nwarps_host — returns 4 for RDNA3.5 (host).
  • mmq_get_nwarps_device — returns 4 under #if defined(RDNA3_5) (device).

Additional information

27 models including both dense and moe (gemma4_26b_a4b, qwen35_35b_a3b, qwen3_30b_a3b), from small (qwen25_05b, smollm2_17b, gemma2_2b) to large (qwen3_17b) models, all Q4_K_M were tests on gfx1151. Prefill at n=128 has the most performance boost, many models see +14% to +18% improvement. At longer sequences (512–4096), most models see a consistent +2% to +8% prefill improvement. It changes the mmq nwarps and mmq_y_max which do not impact mmvq's performance. Decode is essentially neutral, no regression.

PPL check has been performed, All 27 models are bit-identical.

Requirements

@Annieren Annieren requested a review from jimw567 June 17, 2026 06:21
@jimw567

jimw567 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Code Review

Summary of Changes

Adds RDNA3.5 (gfx115x) specific tuning to the CUDA/HIP MMQ (quantized matmul) path in ggml/src/ggml-cuda/mmq.cuh:

  • get_mmq_y_host / get_mmq_y_device: tile height mmq_y set to 64 for RDNA3.5 (was 128 via the generic AMD/else paths).
  • mmq_get_nwarps_host / mmq_get_nwarps_device: warps-per-block set to 4 for RDNA3.5 (was 8 — host: 256/warp_size with warp_size 32; device: the AMD_WMMA_AVAILABLE branch).

Checklist: summary present ✅, unit tests ⏭️ (perf-tuning constants only, no testable logic — validation is the benchmark + PPL data in the description).

Potential Issues

  • Host/device agreement (the thing that matters here): MMQ requires the host (grid/shared-mem sizing) and device (kernel) to compute identical mmq_y and nwarps, or launches break. Both pairs match for RDNA3.5 — 64/64 and 4/4 — so this is correct. ✅
  • Branch ordering: the new #if defined(RDNA3_5) is placed before the RDNA1 and AMD_MFMA/WMMA branches, and the host GGML_CUDA_CC_IS_RDNA3_5(cc) guard sits before the generic AMD ternary. Correct specific-before-general ordering. ✅
  • Macro assumptions: correctness depends on RDNA3_5 (device) being defined for gfx115x builds and GGML_CUDA_CC_IS_RDNA3_5(cc) (host) existing. Both are pre-existing in the tree, so this is consistent with how RDNA1 is already handled — just flagging it as the load-bearing assumption.
  • No regression risk to the mmvq path: these constants only feed MMQ, consistent with the description's "neutral decode" claim.

Suggestions for Improvement

  • Consider a one-line comment on the 64 / 4 constants noting they're empirically tuned for RDNA3.5 (prefill +14–18% at n=128), so a future reader doesn't "simplify" them back into the generic AMD path.

Overall: small, low-risk, well-validated tuning change — 27 models benchmarked on gfx1151 with bit-identical PPL and no decode regression. LGTM.

@jimw567 jimw567 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants