Skip to content

[Optimization]Merge Text processor#7030

Open
luukunn wants to merge 8 commits intoPaddlePaddle:developfrom
luukunn:merge_processor_2
Open

[Optimization]Merge Text processor#7030
luukunn wants to merge 8 commits intoPaddlePaddle:developfrom
luukunn:merge_processor_2

Conversation

@luukunn
Copy link
Copy Markdown
Collaborator

@luukunn luukunn commented Mar 26, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 26, 2026 08:31
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 26, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在合并/统一文本侧的 processor 实现:把原先分散在 DataProcessorErnie4_5Processor 的请求/响应处理逻辑抽到公共基类,并新增 TextProcessor 用一个类覆盖 auto 与 ernie4_5 两种 tokenizer 路径,从而简化 InputPreprocessor.create_processor() 的分发逻辑。

Changes:

  • 新增 BaseTextProcessorfastdeploy/input/base_processor.py),集中实现 request/response 处理与通用工具方法。
  • 新增 TextProcessor 并调整 InputPreprocessor.create_processor():非 V1 且纯文本场景统一走 TextProcessor(tokenizer_type=auto/ernie4_5)
  • Ernie4_5Processor 改为弃用包装类,并补充 VL processor 的 tokenizer_type 以适配统一的增量解码逻辑;同步更新相关单测 mock。

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/input/test_preprocess.py 更新路由单测:mock 目标由 DataProcessor 改为 TextProcessor
fastdeploy/input/text_processor.py 引入 TextProcessorDataProcessor 改为继承新的公共基类并移除本地 request/response 处理实现。
fastdeploy/input/preprocess.py 纯文本、非 V1 路径统一使用 TextProcessor,并按架构选择 tokenizer_type
fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py 增加 tokenizer_type="ernie4_5" 以保证继承的解码分支选择正确。
fastdeploy/input/ernie4_5_processor.py Ernie4_5Processor 改为弃用包装类,转发到 TextProcessor(tokenizer_type='ernie4_5')
fastdeploy/input/base_processor.py 新增公共基类 BaseTextProcessor,实现共享 request/response 处理与通用方法。
Comments suppressed due to low confidence (1)

fastdeploy/input/text_processor.py:248

  • 这里 DataProcessor 现在继承了 BaseTextProcessor,而 BaseTextProcessor.process_response_dict_* 会假设 ids2tokens() 始终返回 3 元组 (delta_text, previous_token_ids, previous_texts)。但当前 DataProcessor.ids2tokens()envs.FD_USE_HF_TOKENIZER=True 分支仍返回单个字符串,会导致 streaming / non-streaming 响应处理时解包失败(例如 delta_text, _, previous_texts = ...)。建议让 DataProcessor 直接复用 BaseTextProcessor.ids2tokens()(删除覆盖),或把 HF 分支返回值改为与基类一致的 3 元组。
class DataProcessor(BaseTextProcessor):
    def __init__(self, model_name_or_path, reasoning_parser_obj=None, tool_parser_obj=None):
        """
            Initializes the DecodeStatus object.

Comment on lines +39 to +40
* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
set. This matches DataProcessor behaviour.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的设计说明与实际实现不一致:文档写到 tool_parser 结果不会更新 outputs["text"],但 process_response_dict_streaming() 中在解析到 tool_calls 时会把 outputs["text"] 赋值为 tool_call_delta_message.content。建议要么更新文档描述(区分 streaming 与 non-streaming 行为),要么调整实现以匹配文档中的约定。

Suggested change
* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
set. This matches DataProcessor behaviour.
* In non-streaming mode, tool_parser results never update ``outputs["text"]``;
only ``tool_calls`` is set, matching ``DataProcessor`` behaviour. In
streaming mode, ``process_response_dict_streaming()`` may temporarily set
``outputs["text"]`` to the current ``tool_call_delta_message.content`` while
parsing tool calls.

Copilot uses AI. Check for mistakes.
Comment on lines +497 to +503
if token_id == think_start_id:
started = True
ended = False
in_thinking = True
elif token_id == think_end_id and in_thinking:
ended = True
in_thinking = False
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens_after_start 初始化为 0 但在遍历 token_list 时从未累加,导致 think_prompt_tokens_after_start 永远为 0。该字段会被 fastdeploy/model_executor/logits_processor/thinking_budget.py 读取用于推理预算状态更新,恒为 0 会让预算逻辑失真。建议在遇到 <think> 之后对非结束标记 token 进行计数,或移除该字段避免产生误导。

Suggested change
if token_id == think_start_id:
started = True
ended = False
in_thinking = True
elif token_id == think_end_id and in_thinking:
ended = True
in_thinking = False
if token_id == think_start_id:
# Enter thinking segment; do not count the start token itself
started = True
ended = False
in_thinking = True
elif token_id == think_end_id and in_thinking:
# Leave thinking segment; do not count the end token itself
ended = True
in_thinking = False
elif in_thinking:
# Count tokens inside the thinking segment, excluding start/end markers
tokens_after_start += 1

Copilot uses AI. Check for mistakes.
Comment on lines +88 to 93
tokenizer_type = "ernie4_5" if ErnieArchitectures.contains_ernie_arch(architecture) else "auto"
self.processor = TextProcessor(
model_name_or_path=self.model_name_or_path,
tokenizer_type=tokenizer_type,
reasoning_parser_obj=reasoning_parser_obj,
tool_parser_obj=tool_parser_obj,
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里新增了根据架构选择 tokenizer_type(ernie4_5 vs auto)并走 TextProcessor 的分支,但当前单测只覆盖了“非 Ernie、非多模态”的路径。建议补一个用例覆盖 Ernie 架构(例如 architecture 为 ERNIE4.5 相关字符串)时 tokenizer_type 被设置为 ernie4_5,并验证 TextProcessor 的构造参数。

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 85.68129% with 62 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@25d64ef). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/input/base_processor.py 86.34% 30 Missing and 23 partials ⚠️
fastdeploy/input/preprocess.py 42.85% 3 Missing and 1 partial ⚠️
fastdeploy/input/text_processor.py 90.32% 1 Missing and 2 partials ⚠️
fastdeploy/input/ernie4_5_processor.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7030   +/-   ##
==========================================
  Coverage           ?   72.98%           
==========================================
  Files              ?      400           
  Lines              ?    56289           
  Branches           ?     8884           
==========================================
  Hits               ?    41081           
  Misses             ?    12300           
  Partials           ?     2908           
Flag Coverage Δ
GPU 72.98% <85.68%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@luukunn luukunn changed the title Merge processor 2 [Optimization]Merge Text processor Mar 27, 2026
Copilot AI review requested due to automatic review settings March 27, 2026 08:10

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings March 27, 2026 11:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comment on lines +26 to +30
* ``__init__`` only initialises response-handling state (decode_status,
model_status_dict, tool_parser_dict). Tokeniser setup is the responsibility
of each subclass. Subclasses that do not call ``super().__init__()`` must
initialise those three attributes themselves.

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件头部“Key design decisions”里提到“init only initialises response-handling state… Tokeniser setup is the responsibility of each subclass”,但当前 BaseTextProcessor.init 会加载 GenerationConfig、调用 _load_tokenizer、初始化 eos/pad/tokenizer.pad_token_id 等,描述与实现不一致,容易误导后续维护者。建议更新这段说明或改为与当前初始化流程一致。

Suggested change
* ``__init__`` only initialises response-handling state (decode_status,
model_status_dict, tool_parser_dict). Tokeniser setup is the responsibility
of each subclass. Subclasses that do not call ``super().__init__()`` must
initialise those three attributes themselves.
* ``__init__`` performs the full initialisation sequence: it loads the
generation config (if available), calls ``_load_tokenizer`` to construct
the tokenizer, configures EOS / pad token ids, and initialises the core
response-handling state (``decode_status``, ``model_status_dict``,
``tool_parser_dict``).
* Concrete subclasses are responsible for implementing ``_load_tokenizer``
and ``text2ids``. Subclasses that override ``__init__`` must call
``super().__init__(...)`` or replicate this initialisation logic
themselves.

Copilot uses AI. Check for mistakes.
Comment on lines +945 to 947
with patch("fastdeploy.input.utils.process_stop_token_ids", lambda *args, **kwargs: None):
processed = processor.process_request_dict(request, max_model_len=16)
self.assertEqual(
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里同样 patch 到 fastdeploy.input.utils.process_stop_token_ids,但 TextDataProcessor.process_request_dict 实际实现来自 fastdeploy.input.base_processor.BaseTextProcessor(其中也是 from fastdeploy.input.utils import process_stop_token_ids 并用本地名调用)。因此 patch utils 不会影响 BaseTextProcessor 内部引用,测试可能仍会执行真实逻辑。建议 patch fastdeploy.input.base_processor.process_stop_token_ids(并把本文件其他同类 patch 一并调整)。

Copilot uses AI. Check for mistakes.
Comment on lines +142 to 143
self.gen_patcher = patch(f"{TEXT_PROCESSOR_PATH}.GenerationConfig.from_pretrained", return_value=MagicMock())
self.tokenizer_patcher = patch(
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setUp 里 patch 的目标改成了 fastdeploy.input.text_processor.GenerationConfig.from_pretrained,但 TextProcessor/Ernie4_5Processor 现在是在 fastdeploy.input.base_processor 中 import 并调用 GenerationConfig.from_pretrained。当前 patch 可能不会生效,导致单测意外去读真实的 generation_config。建议把 patch 目标改为 fastdeploy.input.base_processor.GenerationConfig.from_pretrained(或实际被调用的模块路径)。

Copilot uses AI. Check for mistakes.
@@ -18,16 +18,10 @@
from collections import OrderedDict
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本还是模板占位(Motivation/Modifications/Usage/Accuracy Tests 未填写,Checklist 也未勾选),不利于评审者确认“合并 Text processor”的动机、行为变化范围以及验证方式。建议补充:为何要合并、是否有兼容性/行为差异、如何回归(至少给出单测/命令或关键场景验证)。另外标题建议按模板格式在 tag 后加空格(例如 [Optimization] Merge Text processor)。

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +216
if envs.FD_USE_HF_TOKENIZER:
if task_id not in self.decode_status:
# [all_token_ids, list_of_deltas, full_accumulated_string]
self.decode_status[task_id] = [[], [], ""]
status = self.decode_status[task_id]
status[0].extend(token_id)
decode_str = self.tokenizer.batch_decode(
[status[0]],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
if isinstance(decode_str, list) and len(decode_str):
new_str = decode_str[0].replace(status[2], "", 1)
status[1].append(new_str)
status[2] = decode_str[0]
else:
new_str = ""
# Return consistent three-tuple; previous_token_ids not available.
return new_str, [], status[2]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseTextProcessor.ids2tokens 在 HF tokenizer 分支返回的 third value 使用了更新后的累计文本(status[2])。这里按 docstring 语义应该返回“previous_texts”(更新前的累计文本);否则在 process_response_dict_streaming 里传入 previous_texts + delta_text 会把本次增量重复拼接,导致 reasoning/tool streaming 解析收到错误的 full_text。建议在更新 status[2] 前先保存旧值并返回旧值。

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants