Support neox partial RoPE (head_dim=256) for Qwen3.5#7043
Open
wangna11BD wants to merge 2 commits intoPaddlePaddle:developfrom
Open
Support neox partial RoPE (head_dim=256) for Qwen3.5#7043wangna11BD wants to merge 2 commits intoPaddlePaddle:developfrom
wangna11BD wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
为 FastDeploy 增加对 Qwen3.5 模型的推理支持。Qwen3.5 使用
head_dim=256、partial_rotary_factor=0.25(即rotary_dim=64)的 neox 风格部分旋转位置编码(partial RoPE),此前框架不支持该配置。本 PR 新增了对应的 GPU kernel 和 Python 层支持,同时修复了量化 KV cache 写入 kernel 中 shared memory 相关的 bug。Modifications
1. 新增 Qwen3.5 partial neox RoPE CUDA kernel(
custom_ops/gpu_ops/append_attn/qwen3_rope.h)GQAVariableLengthRotarySplitKernel_Qwen3(从.cu迁移,head_dim=128,Qwen3 全量交错式 RoPE)GQAVariableLengthNeoxPartialRotarySplitKernel_Qwen3_5(新增,head_dim=256,Qwen3.5 neox 风格 partial RoPE)[0, rotary_dim=64)部分进行旋转(rotate_half语义),[rotary_dim, head_dim)部分直接透传。DISPATCH_GQA_ROPE_HEAD_DIM宏,统一分发head_dim=128(Qwen3)和head_dim=256(Qwen3.5)两条路径。2. 路由逻辑更新(
custom_ops/gpu_ops/append_attn/gqa_rope_write_cache.cu)head_dim==256分支:从 embedding tensor shape 中自动推断rotary_dim,并路由至 Qwen3.5 kernel。AppendCacheKV<data_t, 128, 64>调用统一改为DISPATCH_GQA_ROPE_HEAD_DIM宏分发,使 KV cache 写入流程支持head_dim=256。3. Python 层 RoPE 修复与 MRoPE 支持(
fastdeploy/model_executor/layers/rotary_embedding.py)QwenRotaryEmbedding此前接受partial_rotary_factor参数但未实际生效,现修复为正确应用:rotary_dim = int(head_dim * partial_rotary_factor)。mrope_section参数,支持 Qwen3.5-VL 多模态 RoPE,新增apply_interleaved_mrope方法将 T/H/W 三组位置频率以交错方式合并。get_rope_impl函数新增从rope_parameters中读取mrope_section并传入QwenRotaryEmbedding。4. Shared memory Bug 修复(
custom_ops/gpu_ops/append_attn/encoder_write_cache_with_rope_impl.cuh)append_write_cache_kv_c8_qkv、append_write_cache_kv_c8_qkv_dynamic、append_write_cache_kv_c4_qkvkernel 中的静态__shared__数组声明改为动态 shared memory(extern __shared__ char dyn_smem_buf[]),避免大HEAD_DIM场景下的编译或运行时问题。CascadeAppendWriteCacheKVC8QKV和CascadeAppendWriteCacheKVC4QKVlauncher 中 kernel launch 时 shared memory size 参数传0的错误,改为传入正确计算的smem_size。5. 单元测试(
tests/layers/test_qwen35_rope.py)TestQwenRotaryEmbedding:验证partial_rotary_factor正确生效、输出 shape、cos/sin 数值正确性、MRoPE section 各种场景。TestGqaRopeWriteCacheQwen35:端到端 CUDA kernel 测试,包括 Qwen3.5(head_dim=256)neox partial RoPE 数值正确性(与rotate_half参考实现对比)、V 不被旋转验证、透传区域不变验证、以及 Qwen3(head_dim=128)回归测试。Usage or Command
python tests/layers/test_qwen35_rope.pyAccuracy Tests
新增的 CUDA kernel 输出已在 test_neox_partial_rope_correctness 测试用例中与 PyTorch rotate_half 参考实现进行了数值对比,验证精度 atol=1e-2。
Checklist
[Feature]]pre-commitbefore commit.