-
Notifications
You must be signed in to change notification settings - Fork 680
[Feature]Support reorder ids to split prefill and decodes #5779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Thanks for your contribution! |
…into pd_reorder
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5779 +/- ##
==========================================
Coverage ? 66.48%
==========================================
Files ? 348
Lines ? 44749
Branches ? 6867
==========================================
Hits ? 29753
Misses ? 12806
Partials ? 2190
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| self.model_inputs["input_ids"][idx : idx + 1, :input_length] = np.array([5] * input_length) | ||
| self.model_inputs["eos_token_id"][:] = np.array([2], dtype="int64").reshape(-1, 1) | ||
| self.seq_lens_this_time_buffer[idx : idx + 1] = input_length | ||
| self.model_inputs["seq_lens_this_time_buffer"][idx : idx + 1] = input_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里model_inputs已经是一个object了,还并存了dict的key-value访问方式是否合理?
原本此处逻辑seq_lens_this_time_buffer是MTPProposer的成员变量,现在又合并回了model_inputs里,是否有其他影响?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改动的地方太多了,保留了用key访问的接口,gpumodelrunner用的 InputBatch ,mtp的对象是 ProposerInputBatch 都是自己独立的 MTPProposer的成员变量 和 ProposerInputBatch 的成员变量应该就只是 多套了一层的差别吧,seq_lens_this_time_buffer也是和 req id强相关的只能放到 InputBatch内部才能参与排序
| req_len = len(req_dicts) | ||
|
|
||
| self.model_inputs["num_running_requests"] = num_running_requests | ||
| self.model_inputs["running_requests_ids"] = range(num_running_requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| # self.model_inputs["seq_lens_this_time"] = self.seq_lens_this_time_buffer[:num_running_requests] | ||
| self.model_inputs["seq_lens_this_time"] = self.seq_lens_this_time_buffer | ||
| # self.model_inputs["seq_lens_this_time"] = self.model_inputs["seq_lens_this_time_buffer"][:num_running_requests] | ||
| self.model_inputs.seq_lens_this_time = self.model_inputs["seq_lens_this_time_buffer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| self.proposer = NgramProposer(self.fd_config) | ||
| elif self.speculative_method == "mtp": | ||
| self.share_inputs["seq_lens_this_time"] = self.seq_lens_this_time_buffer | ||
| self.share_inputs["seq_lens_this_time"] = self.share_inputs["seq_lens_this_time_buffer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
与MTPProposer中的同问
|
代码行数比较多,单测覆盖率需要补齐,尤其是input_batch.py |
| image_features_list.append(paddle.concat(merge_image_features, axis=0)) | ||
| for _, index in req_idx_img_index_map.items(): | ||
| if index != -1: | ||
| self.share_inputs["image_features_list"][idx] = image_features_list[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请教下上次shape对不上是因为这里只for了一次吗,还是其他问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对可以理解成只for了一次,并且以前视频输入 也没有和req_id 绑定 导致重排很困难,目前是用新的list和req_id绑定上,每次append 进 image_features_list 都是某一个req_id 的图像特征
| ) | ||
| ) | ||
|
|
||
| if self.encoder_cache is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| img_index = img_index + 1 | ||
| inputs = request.multimodal_inputs | ||
| if self.encoder_cache is not None: | ||
| if envs.FD_ENABLE_MAX_PREFILL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里使用 encoder_cache 的场景感觉也需要 feature_position_list_batches 记录每条请求的位置信息?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有的 新增了单测,目前覆盖率没有过的地方是 get_attention_meta这个函数没有命中 |

Motivation
PR from #5194
为了能够更好的接入三方Attention,需要对输入进行重排,将prefill token和decode token区分开来,本PR支持了重排功能,目前支持了基础场景及投机解码场景下的重排
Modifications
当前PD重排仅支持 CUDA backend
1.新增InputBatch结构用于gpumodelrunner share_input 管理,新增ProposerInputBatch 用于mtp share_input 管理,用于管理gpu_model_runner的输入,并且添加reorder_split_prefill_and_decode和condense函数支持重排
2.merge develop
3.为每个VL请求增加req_id -> img_features的映射方便重排
Usage or Command
在AttentionBackend中添加类变量enable_ids_reorder字段并设置为True,即可使用P/D重排功能
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.