[Optimization]Merge Text processor#7030
[Optimization]Merge Text processor#7030luukunn wants to merge 8 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在合并/统一文本侧的 processor 实现:把原先分散在 DataProcessor 与 Ernie4_5Processor 的请求/响应处理逻辑抽到公共基类,并新增 TextProcessor 用一个类覆盖 auto 与 ernie4_5 两种 tokenizer 路径,从而简化 InputPreprocessor.create_processor() 的分发逻辑。
Changes:
- 新增
BaseTextProcessor(fastdeploy/input/base_processor.py),集中实现 request/response 处理与通用工具方法。 - 新增
TextProcessor并调整InputPreprocessor.create_processor():非 V1 且纯文本场景统一走TextProcessor(tokenizer_type=auto/ernie4_5)。 - 将
Ernie4_5Processor改为弃用包装类,并补充 VL processor 的tokenizer_type以适配统一的增量解码逻辑;同步更新相关单测 mock。
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/input/test_preprocess.py | 更新路由单测:mock 目标由 DataProcessor 改为 TextProcessor。 |
| fastdeploy/input/text_processor.py | 引入 TextProcessor;DataProcessor 改为继承新的公共基类并移除本地 request/response 处理实现。 |
| fastdeploy/input/preprocess.py | 纯文本、非 V1 路径统一使用 TextProcessor,并按架构选择 tokenizer_type。 |
| fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py | 增加 tokenizer_type="ernie4_5" 以保证继承的解码分支选择正确。 |
| fastdeploy/input/ernie4_5_processor.py | 将 Ernie4_5Processor 改为弃用包装类,转发到 TextProcessor(tokenizer_type='ernie4_5')。 |
| fastdeploy/input/base_processor.py | 新增公共基类 BaseTextProcessor,实现共享 request/response 处理与通用方法。 |
Comments suppressed due to low confidence (1)
fastdeploy/input/text_processor.py:248
- 这里
DataProcessor现在继承了BaseTextProcessor,而BaseTextProcessor.process_response_dict_*会假设ids2tokens()始终返回 3 元组(delta_text, previous_token_ids, previous_texts)。但当前DataProcessor.ids2tokens()在envs.FD_USE_HF_TOKENIZER=True分支仍返回单个字符串,会导致 streaming / non-streaming 响应处理时解包失败(例如delta_text, _, previous_texts = ...)。建议让DataProcessor直接复用BaseTextProcessor.ids2tokens()(删除覆盖),或把 HF 分支返回值改为与基类一致的 3 元组。
class DataProcessor(BaseTextProcessor):
def __init__(self, model_name_or_path, reasoning_parser_obj=None, tool_parser_obj=None):
"""
Initializes the DecodeStatus object.
| * tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is | ||
| set. This matches DataProcessor behaviour. |
There was a problem hiding this comment.
这里的设计说明与实际实现不一致:文档写到 tool_parser 结果不会更新 outputs["text"],但 process_response_dict_streaming() 中在解析到 tool_calls 时会把 outputs["text"] 赋值为 tool_call_delta_message.content。建议要么更新文档描述(区分 streaming 与 non-streaming 行为),要么调整实现以匹配文档中的约定。
| * tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is | |
| set. This matches DataProcessor behaviour. | |
| * In non-streaming mode, tool_parser results never update ``outputs["text"]``; | |
| only ``tool_calls`` is set, matching ``DataProcessor`` behaviour. In | |
| streaming mode, ``process_response_dict_streaming()`` may temporarily set | |
| ``outputs["text"]`` to the current ``tool_call_delta_message.content`` while | |
| parsing tool calls. |
| if token_id == think_start_id: | ||
| started = True | ||
| ended = False | ||
| in_thinking = True | ||
| elif token_id == think_end_id and in_thinking: | ||
| ended = True | ||
| in_thinking = False |
There was a problem hiding this comment.
tokens_after_start 初始化为 0 但在遍历 token_list 时从未累加,导致 think_prompt_tokens_after_start 永远为 0。该字段会被 fastdeploy/model_executor/logits_processor/thinking_budget.py 读取用于推理预算状态更新,恒为 0 会让预算逻辑失真。建议在遇到 <think> 之后对非结束标记 token 进行计数,或移除该字段避免产生误导。
| if token_id == think_start_id: | |
| started = True | |
| ended = False | |
| in_thinking = True | |
| elif token_id == think_end_id and in_thinking: | |
| ended = True | |
| in_thinking = False | |
| if token_id == think_start_id: | |
| # Enter thinking segment; do not count the start token itself | |
| started = True | |
| ended = False | |
| in_thinking = True | |
| elif token_id == think_end_id and in_thinking: | |
| # Leave thinking segment; do not count the end token itself | |
| ended = True | |
| in_thinking = False | |
| elif in_thinking: | |
| # Count tokens inside the thinking segment, excluding start/end markers | |
| tokens_after_start += 1 |
| tokenizer_type = "ernie4_5" if ErnieArchitectures.contains_ernie_arch(architecture) else "auto" | ||
| self.processor = TextProcessor( | ||
| model_name_or_path=self.model_name_or_path, | ||
| tokenizer_type=tokenizer_type, | ||
| reasoning_parser_obj=reasoning_parser_obj, | ||
| tool_parser_obj=tool_parser_obj, |
There was a problem hiding this comment.
这里新增了根据架构选择 tokenizer_type(ernie4_5 vs auto)并走 TextProcessor 的分支,但当前单测只覆盖了“非 Ernie、非多模态”的路径。建议补一个用例覆盖 Ernie 架构(例如 architecture 为 ERNIE4.5 相关字符串)时 tokenizer_type 被设置为 ernie4_5,并验证 TextProcessor 的构造参数。
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7030 +/- ##
==========================================
Coverage ? 72.98%
==========================================
Files ? 400
Lines ? 56289
Branches ? 8884
==========================================
Hits ? 41081
Misses ? 12300
Partials ? 2908
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| * ``__init__`` only initialises response-handling state (decode_status, | ||
| model_status_dict, tool_parser_dict). Tokeniser setup is the responsibility | ||
| of each subclass. Subclasses that do not call ``super().__init__()`` must | ||
| initialise those three attributes themselves. | ||
|
|
There was a problem hiding this comment.
文件头部“Key design decisions”里提到“init only initialises response-handling state… Tokeniser setup is the responsibility of each subclass”,但当前 BaseTextProcessor.init 会加载 GenerationConfig、调用 _load_tokenizer、初始化 eos/pad/tokenizer.pad_token_id 等,描述与实现不一致,容易误导后续维护者。建议更新这段说明或改为与当前初始化流程一致。
| * ``__init__`` only initialises response-handling state (decode_status, | |
| model_status_dict, tool_parser_dict). Tokeniser setup is the responsibility | |
| of each subclass. Subclasses that do not call ``super().__init__()`` must | |
| initialise those three attributes themselves. | |
| * ``__init__`` performs the full initialisation sequence: it loads the | |
| generation config (if available), calls ``_load_tokenizer`` to construct | |
| the tokenizer, configures EOS / pad token ids, and initialises the core | |
| response-handling state (``decode_status``, ``model_status_dict``, | |
| ``tool_parser_dict``). | |
| * Concrete subclasses are responsible for implementing ``_load_tokenizer`` | |
| and ``text2ids``. Subclasses that override ``__init__`` must call | |
| ``super().__init__(...)`` or replicate this initialisation logic | |
| themselves. |
| with patch("fastdeploy.input.utils.process_stop_token_ids", lambda *args, **kwargs: None): | ||
| processed = processor.process_request_dict(request, max_model_len=16) | ||
| self.assertEqual( |
There was a problem hiding this comment.
这里同样 patch 到 fastdeploy.input.utils.process_stop_token_ids,但 TextDataProcessor.process_request_dict 实际实现来自 fastdeploy.input.base_processor.BaseTextProcessor(其中也是 from fastdeploy.input.utils import process_stop_token_ids 并用本地名调用)。因此 patch utils 不会影响 BaseTextProcessor 内部引用,测试可能仍会执行真实逻辑。建议 patch fastdeploy.input.base_processor.process_stop_token_ids(并把本文件其他同类 patch 一并调整)。
| self.gen_patcher = patch(f"{TEXT_PROCESSOR_PATH}.GenerationConfig.from_pretrained", return_value=MagicMock()) | ||
| self.tokenizer_patcher = patch( |
There was a problem hiding this comment.
setUp 里 patch 的目标改成了 fastdeploy.input.text_processor.GenerationConfig.from_pretrained,但 TextProcessor/Ernie4_5Processor 现在是在 fastdeploy.input.base_processor 中 import 并调用 GenerationConfig.from_pretrained。当前 patch 可能不会生效,导致单测意外去读真实的 generation_config。建议把 patch 目标改为 fastdeploy.input.base_processor.GenerationConfig.from_pretrained(或实际被调用的模块路径)。
| @@ -18,16 +18,10 @@ | |||
| from collections import OrderedDict | |||
There was a problem hiding this comment.
PR 描述目前基本还是模板占位(Motivation/Modifications/Usage/Accuracy Tests 未填写,Checklist 也未勾选),不利于评审者确认“合并 Text processor”的动机、行为变化范围以及验证方式。建议补充:为何要合并、是否有兼容性/行为差异、如何回归(至少给出单测/命令或关键场景验证)。另外标题建议按模板格式在 tag 后加空格(例如 [Optimization] Merge Text processor)。
| if envs.FD_USE_HF_TOKENIZER: | ||
| if task_id not in self.decode_status: | ||
| # [all_token_ids, list_of_deltas, full_accumulated_string] | ||
| self.decode_status[task_id] = [[], [], ""] | ||
| status = self.decode_status[task_id] | ||
| status[0].extend(token_id) | ||
| decode_str = self.tokenizer.batch_decode( | ||
| [status[0]], | ||
| skip_special_tokens=True, | ||
| clean_up_tokenization_spaces=False, | ||
| ) | ||
| if isinstance(decode_str, list) and len(decode_str): | ||
| new_str = decode_str[0].replace(status[2], "", 1) | ||
| status[1].append(new_str) | ||
| status[2] = decode_str[0] | ||
| else: | ||
| new_str = "" | ||
| # Return consistent three-tuple; previous_token_ids not available. | ||
| return new_str, [], status[2] |
There was a problem hiding this comment.
BaseTextProcessor.ids2tokens 在 HF tokenizer 分支返回的 third value 使用了更新后的累计文本(status[2])。这里按 docstring 语义应该返回“previous_texts”(更新前的累计文本);否则在 process_response_dict_streaming 里传入 previous_texts + delta_text 会把本次增量重复拼接,导致 reasoning/tool streaming 解析收到错误的 full_text。建议在更新 status[2] 前先保存旧值并返回旧值。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.