[XPU] Refactor get_padding_offset to single kernel.#7029
[XPU] Refactor get_padding_offset to single kernel.#7029Jiajun-Ji wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在将 XPU 的 get_padding_offset 重构为单 kernel 实现,以对齐 GPU 侧实现思路,并减少多 kernel 调用开销。
Changes:
- 调整 XPU Python 侧
get_padding_offset调用签名:改为只传input_ids/seq_len/cpu_token_num。 - 更新 XPU 自定义算子/插件接口:移除
cum_offsets/padding_offset输入输出相关依赖,输出改为ids_remove_padding + batch_id_per_token + cum_offsets_out + cu_seqlens_*。 - 合并 Kunlun3(XPU3) 上的 padding_offset 计算与 remove_padding 逻辑到单个 XPU kernel。
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/model_executor/xpu_pre_and_post_process.py | 更新 XPU 预处理流程对 get_padding_offset 的调用方式与入参 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/get_padding_offset.cpp | 插件 wrapper 调整:单 kernel 调用,接口参数重排/裁剪 |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/get_padding_offset.xpu | 新的 Kunlun3 单 kernel 实现(计算 cu_seqlens/cum_offsets + remove padding + batch_id 映射) |
| custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h | 更新插件头文件中 get_padding_offset 的导出接口签名 |
| custom_ops/xpu_ops/src/ops/pybind/pybind.cc | 更新 pybind 暴露的 get_padding_offset Python 签名 |
| custom_ops/xpu_ops/src/ops/get_padding_offset.cc | 更新 Paddle 扩展侧算子实现与 OP 注册:移除输入、增加 cpu_token_num 属性/参数 |
| @@ -148,10 +149,10 @@ | |||
| ( | |||
| ids_remove_padding, | |||
| cum_offsets, | |||
| batch_id_per_token, | |||
| cu_seqlens_q, | |||
| cu_seqlens_k, | |||
| ) = get_padding_offset(input_ids, cum_offsets_now, token_num, seq_lens_this_time) | |||
| ) = get_padding_offset(input_ids, seq_lens_this_time, token_num_cpu) | |||
There was a problem hiding this comment.
这里的 token_num_cpu = paddle.sum(...).cpu() 得到的是 CPU Tensor(不是 Python 标量)。但 XPU 的 get_padding_offset 绑定现在第三个参数是 int64_t cpu_token_num,传 Tensor 会在 pybind 转换时报类型错误并导致运行期直接崩溃。建议改为从 Tensor 提取标量(如 int(paddle.sum(seq_lens_this_time).item())),并避免重复计算 sum。
| cum_offsets_now = paddle.cumsum(max_len - seq_lens_this_time, dtype="int32") | ||
| token_num = paddle.sum(seq_lens_this_time) | ||
| token_num_cpu = paddle.sum(seq_lens_this_time).cpu() | ||
|
|
||
| if use_speculate_method: |
There was a problem hiding this comment.
cum_offsets_now 和 token_num 在非 speculate 分支不再被使用,但当前每次都会计算一次 cumsum/sum,同时又额外算了一次 paddle.sum(...).cpu(),会引入不必要的 XPU 计算和(修复后)同步开销。建议把 cum_offsets_now/token_num 移到 use_speculate_method 分支内,并复用一次 sum 的结果。
| cum_offsets_now = paddle.cumsum(max_len - seq_lens_this_time, dtype="int32") | |
| token_num = paddle.sum(seq_lens_this_time) | |
| token_num_cpu = paddle.sum(seq_lens_this_time).cpu() | |
| if use_speculate_method: | |
| token_num = paddle.sum(seq_lens_this_time) | |
| token_num_cpu = token_num.cpu() | |
| if use_speculate_method: | |
| cum_offsets_now = paddle.cumsum( | |
| max_len - seq_lens_this_time, dtype="int32" | |
| ) |
| __shared__ int sm_seq_lens[MAX_BATCH_SIZE]; | ||
| __shared__ int sm_cum_seq_len; | ||
| __simd__ __shared__ int buffer_cu_seqlens[64]; | ||
|
|
||
| if (cid == 0) { | ||
| GM2SM(seq_lens, sm_seq_lens, sizeof(int) * bs); | ||
| } |
There was a problem hiding this comment.
kernel 里 sm_seq_lens 使用固定大小 MAX_BATCH_SIZE=1024 的 shared 数组,但没有任何地方保证/检查 bs <= 1024。一旦 bs 超过该值,GM2SM(seq_lens, sm_seq_lens, sizeof(int)*bs) 会发生越界写入,导致结果错误甚至 kernel 崩溃。建议在 wrapper 侧增加显式断言/返回错误,或把 shared 缓冲改为能覆盖任意 bs 的实现(例如分块加载或动态策略)。
| for (int bi = clusterid; bi < bs; bi += nclusters) { | ||
| int cum_seq_len = 0; | ||
| for (int i = cid; i <= bi; i += ncores) { | ||
| cum_seq_len += sm_seq_lens[i]; | ||
| } | ||
| buffer_cu_seqlens[cid] = cum_seq_len; | ||
| mfence(); | ||
| sync_all(); | ||
|
|
||
| if (cid == 0) { | ||
| int cum_seq_len = (i + 1) * max_seq_len - cum_offsets_lm[1]; | ||
| mfence_lm(); | ||
| LM2GM_ASYNC(cum_offsets_lm, cum_offsets_out + i, sizeof(int)); | ||
| LM2GM_ASYNC(&cum_seq_len, cu_seqlens_q + i + 1, sizeof(int)); | ||
| LM2GM(&cum_seq_len, cu_seqlens_k + i + 1, sizeof(int)); | ||
| cum_seq_len = | ||
| primitive_reduce_sum_sm(buffer_cu_seqlens, min(bi + 1, ncores)); | ||
|
|
||
| LM2GM_ASYNC(&cum_seq_len, cu_seqlens_q + bi + 1, sizeof(int)); | ||
| LM2GM_ASYNC(&cum_seq_len, cu_seqlens_k + bi + 1, sizeof(int)); | ||
|
|
||
| int cum_offset = bi * max_seq_len - (cum_seq_len - sm_seq_lens[bi]); | ||
| LM2GM(&cum_offset, cum_offsets_out + bi, sizeof(int)); | ||
|
|
||
| sm_cum_seq_len = cum_seq_len; | ||
| } | ||
| mfence(); | ||
| sync_all(); |
There was a problem hiding this comment.
当前实现对每个 bi 都通过遍历 seq_lens[0..bi] 来重新计算 cum_seq_len(并做一次 reduce),整体复杂度是 O(bs^2);相比旧实现利用 cum_offsets 做 O(1) 索引,会在大 batch 时明显放大开销。建议改成一次性在 shared/GM 上做 prefix-sum(scan)后直接读 cum_seq_len[bi],或至少避免在每个 bi 上从 0 重新累加。
| ( | ||
| ids_remove_padding, | ||
| cum_offsets, | ||
| batch_id_per_token, | ||
| cu_seqlens_q, | ||
| cu_seqlens_k, | ||
| ) = get_padding_offset(input_ids, cum_offsets_now, token_num, seq_lens_this_time) | ||
| ) = get_padding_offset(input_ids, seq_lens_this_time, token_num_cpu) |
There was a problem hiding this comment.
PR 描述里 Modifications / Usage or Command / Accuracy Tests 等小节目前是空的,但该 PR 修改了 XPU 核心 kernel 行为,建议补充:具体改动点、如何复现/benchmark、以及至少一组精度对齐或回归验证结果,方便后续评审与回滚定位。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7029 +/- ##
==========================================
Coverage ? 73.63%
==========================================
Files ? 399
Lines ? 56412
Branches ? 8919
==========================================
Hits ? 41538
Misses ? 11920
Partials ? 2954
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| sync_all(); | ||
|
|
||
| for (int bi = clusterid; bi < bs; bi += nclusters) { | ||
| int cum_seq_len = 0; | ||
| for (int i = cid; i <= bi; i += ncores) { | ||
| cum_seq_len += sm_seq_lens[i]; | ||
| } | ||
| buffer_cu_seqlens[cid] = cum_seq_len; | ||
| mfence(); | ||
| sync_all(); | ||
|
|
There was a problem hiding this comment.
当前 kernel 在 clusterid 分片的 for 循环里多次调用 sync_all()。如果 sync_all 是跨 cluster 的全局 barrier(本仓库其他 kernel 通常只在所有 cluster 都会执行相同 barrier 次数的场景才用),则不同 cluster 循环迭代次数不一致会导致死锁/卡住。建议改用 sync_cluster()(若只需 cluster 内同步),或重构为所有 cluster 执行一致的 barrier 次数(例如用 tid/nthreads 的全局并行方式而不是 clusterid-stride 循环)。
| __shared__ int sm_seq_lens[MAX_BATCH_SIZE]; | ||
| __shared__ int sm_cum_seq_len; | ||
| __simd__ __shared__ int buffer_cu_seqlens[64]; | ||
|
|
There was a problem hiding this comment.
sm_seq_lens 使用固定大小 MAX_BATCH_SIZE=1024 的 shared buffer,但没有对 bs 做上界校验;当 bs>1024 时 GM2SM 会越界写 shared memory,导致未定义行为。建议在 wrapper 或 kernel 开头增加显式检查/断言(bs<=MAX_BATCH_SIZE),或改为分块加载/使用动态分配方案。
| // Ensure bs does not exceed the shared memory buffer capacity | |
| if (bs > MAX_BATCH_SIZE) { | |
| return; | |
| } |
| std::vector<int64_t> input_ids_shape = input_ids.shape(); | ||
| const int bsz = seq_len.shape()[0]; | ||
| const int seq_length = input_ids_shape[1]; | ||
| auto cum_offsets_out = cum_offsets.copy_to(cum_offsets.place(), false); | ||
| auto cpu_token_num = token_num.copy_to(paddle::CPUPlace(), false); | ||
| const int token_num_data = static_cast<int>(cpu_token_num); | ||
|
|
There was a problem hiding this comment.
cpu_token_num 直接 static_cast 会在超出 int 范围时截断,进而造成输出张量 shape 分配错误甚至越界访问。建议对 cpu_token_num 做范围校验(>=0、<=bsz*seq_length、<=INT_MAX),并尽量在后续逻辑里保持 int64_t 以避免隐式溢出。
EmmonsCurse
left a comment
There was a problem hiding this comment.
LGTM~ Skip coverage check as it mainly relies on XPU end-to-end tests.
Motivation
Rewrite get_padding_offset kernel to align with GPU implementation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.