From 2eae4078ca7771403c39d1d15e3c5d1c715b596a Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 15:23:41 +0800
Subject: [PATCH 1/7] add tinker backend doc

---
 docs/sphinx_doc/source/index.rst              |   1 +
 .../source/tutorial/example_tinker_backend.md | 214 ++++++++++++++++++
 docs/sphinx_doc/source_zh/index.rst           |   1 +
 .../tutorial/example_tinker_backend.md        | 213 +++++++++++++++++
 examples/tinker/README.md                     |  39 +---
 pyproject.toml                                |   2 +-
 scripts/docker/Dockerfile.megatron            |   1 +
 scripts/docker/Dockerfile.uv                  |   4 +-
 8 files changed, 435 insertions(+), 40 deletions(-)
 create mode 100644 docs/sphinx_doc/source/tutorial/example_tinker_backend.md
 create mode 100644 docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md

diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst
index 265b7b59aa..e232fa8155 100644
--- a/docs/sphinx_doc/source/index.rst
+++ b/docs/sphinx_doc/source/index.rst
@@ -42,6 +42,7 @@ Welcome to Trinity-RFT's documentation!
    tutorial/example_react.md
    tutorial/example_search_email.md
    tutorial/example_dpo.md
+   tutorial/example_tinker_backend.md
    tutorial/example_megatron.md
    tutorial/example_data_functionalities.md
    tutorial/example_dataset_perspective.md
diff --git a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
new file mode 100644
index 0000000000..7efdf060de
--- /dev/null
+++ b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
@@ -0,0 +1,214 @@
+# Tinker Backend
+
+```{note}
+This example demonstrates how to use Trinity-RFT with the [Tinker](https://thinkingmachines.ai/tinker/) backend, which enables model training on devices **without GPUs**.
+```
+
+## Setup Instructions
+
+### 1. API Key Configuration
+
+Before starting Ray, you must set the `TRINITY_API_KEY` environment variable to your Tinker API key to enable proper access to Tinker's API:
+
+```bash
+export TRINITY_API_KEY=your_tinker_api_key
+ray start --head
+```
+
+### 2. Configuration File
+
+Configure the Tinker backend in your YAML configuration file by setting the `model.tinker` parameters as shown below:
+
+```yaml
+model:
+  tinker:
+    enable: true
+    base_model: null
+    rank: 32
+    seed: null
+    train_mlp: true
+    train_attn: true
+    train_unembed: true
+```
+
+#### Explanation of Configuration Parameters
+
+- **`tinker`**: Tinker-specific configuration section. **Important**: When Tinker is enabled, any LoRA configuration settings (`model.lora_configs`) will be ignored.
+  - **`enable`**: Whether to activate the Tinker backend. Default: `false`
+  - **`base_model`**: Path to the base model for Tinker. If not specified (`null`), it defaults to the `model_path` defined elsewhere in your config
+  - **`rank`**: The LoRA rank that controls the size of the adaptation matrices. Default: `32`
+  - **`seed`**: Random seed for reproducible Tinker operations. If not specified (`null`), no specific seed is set
+  - **`train_mlp`**: Whether to train the MLP (feed-forward) layers. Default: `true`
+  - **`train_attn`**: Whether to train the attention layers. Default: `true`
+  - **`train_unembed`**: Whether to train the unembedding (output) layer. Default: `true`
+
+
+## Usage
+
+Once configured, Trinity-RFT works with the Tinker backend just like it does with the standard veRL backend. Start training with:
+
+```bash
+trinity run --config tinker.yaml  # Replace with your actual config file path
+```
+
+### Important Limitations of the Tinker Backend
+
+1. **Entropy loss** is not consistent compared to veRL backends.
+2. **Algorithms requiring `compute_advantage_in_trainer=true` are NOT supported currently**, including:
+    - PPO (`algorithm.algorithm_type=ppo`)
+    - Reinforce++ (`algorithm.algorithm_type=reinforceplusplus`)
+    - RLOO (`algorithm.algorithm_type=rloo`)
+    - On-policy distillation (`algorithm.algorithm_type=on_policy_distill`)
+
+    Algorithms like `grpo`, `opmd`, `sft` are supported and we will support more algorithms in the future.
+
+3. **Multiple stages training** is not supported currently, we will add support for this in the future.
+
+> 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml).
+
+
+## Results on the Llama-3.2-3B Model
+
+We trained the **Llama-3.2-3B** model on the **GSM8K** dataset using both the **Tinker** and **veRL** backends. Below are the full configuration files used in our experiments.
+
+
+<details><summary>Click to expand: Tinker Backend Configuration</summary>
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: tinker-llama3.2-3B-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+  algorithm_type: grpo
+  repeat_times: 8
+  kl_loss_fn_args:
+    kl_coef: 0.0
+  optimizer:
+    lr: 1.0e-05
+model:
+  model_path: meta-llama/Llama-3.2-3B
+  max_prompt_tokens: 1024
+  max_response_tokens: 2048
+  custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now(\"%d %b %Y\") %}\n    {%- else %}\n        {%- set date_string = \"26 Jul 2024\" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n        {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n        {{- '\"parameters\": ' }}\n        {{- tool_call.arguments | tojson }}\n        {{- \"}\" }}\n        {{- \"<|eot_id|>\" }}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+  tinker:
+    enable: true
+    base_model: meta-llama/Llama-3.2-3B
+cluster:
+  node_num: 1
+  gpu_per_node: 8
+buffer:
+  batch_size: 96
+  total_epochs: 1
+  explorer_input:
+    taskset:
+      name: taskset
+      storage_type: file
+      path: openai/gsm8k
+      split: train
+      subset_name: main
+      format:
+        prompt_key: question
+        response_key: answer
+    default_workflow_type: math_workflow
+  trainer_input:
+    experience_buffer:
+      name: experience_buffer
+      storage_type: queue
+explorer:
+  runner_per_model: 16
+  rollout_model:
+    engine_num: 4
+    seed: 42
+trainer:
+  save_interval: 100
+  grad_clip: 1.0
+monitor:
+  monitor_type: wandb
+synchronizer:
+  sync_method: checkpoint
+  sync_style: fixed
+  sync_interval: 1
+  sync_offset: 1
+  sync_timeout: 1200
+```
+
+</details>
+
+
+<details><summary>Click to expand: veRL Backend Configuration (LoRA)</summary>
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: verl-llama3.2-3B-lora-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+  algorithm_type: grpo
+  repeat_times: 8
+  kl_loss_fn_args:
+    kl_coef: 0.0
+  optimizer:
+    lr: 1.0e-05
+data_processor: {}
+model:
+  model_path: meta-llama/Llama-3.2-3B
+  max_prompt_tokens: 1024
+  max_response_tokens: 2048
+  custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now(\"%d %b %Y\") %}\n    {%- else %}\n        {%- set date_string = \"26 Jul 2024\" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n        {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n        {{- '\"parameters\": ' }}\n        {{- tool_call.arguments | tojson }}\n        {{- \"}\" }}\n        {{- \"<|eot_id|>\" }}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+  lora_configs:
+  - name: lora
+    lora_rank: 32
+    lora_alpha: 32
+cluster:
+  node_num: 1
+  gpu_per_node: 8
+buffer:
+  batch_size: 96
+  total_epochs: 1
+  explorer_input:
+    taskset:
+      name: taskset
+      storage_type: file
+      path: openai/gsm8k
+      split: train
+      subset_name: main
+      format:
+        prompt_key: question
+        response_key: answer
+    default_workflow_type: math_workflow
+  trainer_input:
+    experience_buffer:
+      name: experience_buffer
+      storage_type: queue
+explorer:
+  runner_per_model: 16
+  rollout_model:
+    engine_num: 4
+    tensor_parallel_size: 1
+    gpu_memory_utilization: 0.9
+    dtype: bfloat16
+    seed: 42
+trainer:
+  trainer_type: verl
+  save_interval: 100
+  grad_clip: 1.0
+monitor:
+  monitor_type: wandb
+synchronizer:
+  sync_method: checkpoint
+  sync_style: fixed
+  sync_interval: 1
+  sync_offset: 1
+  sync_timeout: 1200
+```
+
+</details>
+
+### Observations
+
+Since Llama-3.2-3B is a base (non-instruct-tuned) model, it has limited ability to follow formatting instructions. Additionally, we trained for only **one epoch**. As a result, both backends achieved final rewards just slightly above 0.1. Nonetheless, the training curves show a clear upward trend in reward, indicating successful learning. The results are visualized below:
+
+![Training Rewards on GSM8K](../../docs/sphinx_doc/assets/tinker-gsm8k.png)
diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst
index 798ded6181..09105b12bb 100644
--- a/docs/sphinx_doc/source_zh/index.rst
+++ b/docs/sphinx_doc/source_zh/index.rst
@@ -40,6 +40,7 @@
    tutorial/example_react.md
    tutorial/example_search_email.md
    tutorial/example_dpo.md
+   tutorial/example_tinker_backend.md
    tutorial/example_megatron.md
    tutorial/example_data_functionalities.md
    tutorial/example_dataset_perspective.md
diff --git a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
new file mode 100644
index 0000000000..609837332f
--- /dev/null
+++ b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
@@ -0,0 +1,213 @@
+# Tinker 后端
+
+```{note}
+本示例演示了如何在 Trinity-RFT 中使用 [Tinker](https://thinkingmachines.ai/tinker/)，从而在**无 GPU**的设备上进行模型训练。
+```
+
+## 安装与配置
+
+### 1. API Key 配置
+
+在启动 Ray 之前，必须将 `TRINITY_API_KEY` 环境变量设置为你的 Tinker API 密钥，以便正确访问 Tinker 的 API：
+
+```bash
+export TRINITY_API_KEY=your_tinker_api_key
+ray start --head
+```
+
+### 2. 配置文件
+
+在 YAML 配置文件中通过如下方式设置 `model.tinker` 参数以启用 Tinker 后端：
+
+```yaml
+model:
+  tinker:
+    enable: true
+    base_model: null
+    rank: 32
+    seed: null
+    train_mlp: true
+    train_attn: true
+    train_unembed: true
+```
+
+#### 配置参数说明
+
+- **`tinker`**：Tinker 专用配置部分。**注意**：启用 Tinker 后，所有 LoRA 配置（`model.lora_configs`）将被忽略。
+  - **`enable`**：是否启用 Tinker 后端。默认值：`false`
+  - **`base_model`**：Tinker 的基础模型路径。如果未指定（`null`），则默认为配置中其他位置的 `model_path`
+  - **`rank`**：LoRA 的秩，控制适应矩阵的大小。默认值：`32`
+  - **`seed`**：Tinker 操作的随机种子。未指定（`null`）时不设定特定种子
+  - **`train_mlp`**：是否训练 MLP（前馈）层。默认值：`true`
+  - **`train_attn`**：是否训练注意力层。默认值：`true`
+  - **`train_unembed`**：是否训练输出（unembedding）层。默认值：`true`
+
+
+## 使用方法
+
+配置完成后，Trinity-RFT 使用 Tinker 后端的方式与标准 veRL 后端一致。启动训练命令如下：
+
+```bash
+trinity run --config tinker.yaml  # 请替换为你的实际配置文件路径
+```
+
+### Tinker 后端的功能限制
+
+1. **熵损失（entropy loss）** 与 veRL 后端不完全一致。
+2. **不支持 `compute_advantage_in_trainer=true` 的算法**，包括：
+    - PPO（`algorithm.algorithm_type=ppo`）
+    - Reinforce++（`algorithm.algorithm_type=reinforceplusplus`）
+    - RLOO（`algorithm.algorithm_type=rloo`）
+    - On-policy distillation（`algorithm.algorithm_type=on_policy_distill`）
+
+    目前支持 `grpo`, `opmd`, `sft` 等算法，未来会支持更多算法。
+
+3. **暂不支持多阶段训练**，后续会添加该功能。
+
+> 💡 完整的示例配置文件见 [`tinker.yaml`](tinker.yaml)。
+
+
+## Llama-3.2-3B 模型实验结果
+
+我们在 **GSM8K** 数据集上，分别使用 **Tinker** 和 **veRL** 后端对 **Llama-3.2-3B** 模型进行了训练。以下为实验中使用的完整配置文件。
+
+<details><summary>点击展开：Tinker 后端配置</summary>
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: tinker-llama3.2-3B-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+  algorithm_type: grpo
+  repeat_times: 8
+  kl_loss_fn_args:
+    kl_coef: 0.0
+  optimizer:
+    lr: 1.0e-05
+model:
+  model_path: meta-llama/Llama-3.2-3B
+  max_prompt_tokens: 1024
+  max_response_tokens: 2048
+  custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now(\"%d %b %Y\") %}\n    {%- else %}\n        {%- set date_string = \"26 Jul 2024\" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n        {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n        {{- '\"parameters\": ' }}\n        {{- tool_call.arguments | tojson }}\n        {{- \"}\" }}\n        {{- \"<|eot_id|>\" }}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+  tinker:
+    enable: true
+    base_model: meta-llama/Llama-3.2-3B
+cluster:
+  node_num: 1
+  gpu_per_node: 8
+buffer:
+  batch_size: 96
+  total_epochs: 1
+  explorer_input:
+    taskset:
+      name: taskset
+      storage_type: file
+      path: openai/gsm8k
+      split: train
+      subset_name: main
+      format:
+        prompt_key: question
+        response_key: answer
+    default_workflow_type: math_workflow
+  trainer_input:
+    experience_buffer:
+      name: experience_buffer
+      storage_type: queue
+explorer:
+  runner_per_model: 16
+  rollout_model:
+    engine_num: 4
+    seed: 42
+trainer:
+  save_interval: 100
+  grad_clip: 1.0
+monitor:
+  monitor_type: wandb
+synchronizer:
+  sync_method: checkpoint
+  sync_style: fixed
+  sync_interval: 1
+  sync_offset: 1
+  sync_timeout: 1200
+```
+
+</details>
+
+
+<details><summary>点击展开：veRL 后端配置（LoRA）</summary>
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: verl-llama3.2-3B-lora-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+  algorithm_type: grpo
+  repeat_times: 8
+  kl_loss_fn_args:
+    kl_coef: 0.0
+  optimizer:
+    lr: 1.0e-05
+data_processor: {}
+model:
+  model_path: meta-llama/Llama-3.2-3B
+  max_prompt_tokens: 1024
+  max_response_tokens: 2048
+  custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now(\"%d %b %Y\") %}\n    {%- else %}\n        {%- set date_string = \"26 Jul 2024\" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n        {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n        {{- '\"parameters\": ' }}\n        {{- tool_call.arguments | tojson }}\n        {{- \"}\" }}\n        {{- \"<|eot_id|>\" }}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+  lora_configs:
+  - name: lora
+    lora_rank: 32
+    lora_alpha: 32
+cluster:
+  node_num: 1
+  gpu_per_node: 8
+buffer:
+  batch_size: 96
+  total_epochs: 1
+  explorer_input:
+    taskset:
+      name: taskset
+      storage_type: file
+      path: openai/gsm8k
+      split: train
+      subset_name: main
+      format:
+        prompt_key: question
+        response_key: answer
+    default_workflow_type: math_workflow
+  trainer_input:
+    experience_buffer:
+      name: experience_buffer
+      storage_type: queue
+explorer:
+  runner_per_model: 16
+  rollout_model:
+    engine_num: 4
+    tensor_parallel_size: 1
+    gpu_memory_utilization: 0.9
+    dtype: bfloat16
+    seed: 42
+trainer:
+  trainer_type: verl
+  save_interval: 100
+  grad_clip: 1.0
+monitor:
+  monitor_type: wandb
+synchronizer:
+  sync_method: checkpoint
+  sync_style: fixed
+  sync_interval: 1
+  sync_offset: 1
+  sync_timeout: 1200
+```
+
+</details>
+
+### 结果说明
+
+由于 Llama-3.2-3B 是基础（非指令微调）模型，其格式化指令跟随能力有限，且本实验仅训练了**一个 epoch**。因此，两种后端的最终 reward 都略高于 0.1。但训练曲线显示 reward 呈明显上升趋势，表明模型已成功学习。结果可视化如下：
+
+![GSM8K 训练奖励曲线](../../docs/sphinx_doc/assets/tinker-gsm8k.png)
diff --git a/examples/tinker/README.md b/examples/tinker/README.md
index 79de2c959f..2c55fdf294 100644
--- a/examples/tinker/README.md
+++ b/examples/tinker/README.md
@@ -56,7 +56,8 @@ trinity run --config tinker.yaml  # Replace with your actual config file path
     - RLOO (`algorithm.algorithm_type=rloo`)
     - On-policy distillation (`algorithm.algorithm_type=on_policy_distill`)
 
-    Algorithms like `algorithm.algorithm_type=grpo` are supported. We will add support for these algorithms in the future.
+    Algorithms like `grpo`, `opmd`, `sft` are supported and we will add support for more algorithms in the future.
+
 3. **Multiple stages training** is not supported currently, we will add support for this in the future.
 
 > 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml).
@@ -78,14 +79,10 @@ checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
 algorithm:
   algorithm_type: grpo
   repeat_times: 8
-  sample_strategy: default
   kl_loss_fn_args:
     kl_coef: 0.0
   optimizer:
     lr: 1.0e-05
-    lr_warmup_steps_ratio: 0.0
-    warmup_style: constant
-data_processor: {}
 model:
   model_path: meta-llama/Llama-3.2-3B
   max_prompt_tokens: 1024
@@ -110,29 +107,19 @@ buffer:
       format:
         prompt_key: question
         response_key: answer
-      rollout_args:
-        temperature: 1.0
-        logprobs: 0
-    eval_tasksets: []
     default_workflow_type: math_workflow
   trainer_input:
     experience_buffer:
       name: experience_buffer
       storage_type: queue
-      replay_buffer:
-        enable: false
 explorer:
   runner_per_model: 16
   rollout_model:
     engine_num: 4
     seed: 42
-  auxiliary_models: []
-  eval_interval: 1000
 trainer:
   save_interval: 100
-  enable_preview: true
   grad_clip: 1.0
-  max_token_len_per_gpu: 16384
 monitor:
   monitor_type: wandb
 synchronizer:
@@ -157,13 +144,10 @@ checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
 algorithm:
   algorithm_type: grpo
   repeat_times: 8
-  sample_strategy: default
   kl_loss_fn_args:
     kl_coef: 0.0
   optimizer:
     lr: 1.0e-05
-    lr_warmup_steps_ratio: 0.0
-    warmup_style: constant
 data_processor: {}
 model:
   model_path: meta-llama/Llama-3.2-3B
@@ -190,42 +174,23 @@ buffer:
       format:
         prompt_key: question
         response_key: answer
-      rollout_args:
-        temperature: 1.0
-        logprobs: 0
-    eval_tasksets: []
     default_workflow_type: math_workflow
   trainer_input:
     experience_buffer:
       name: experience_buffer
       storage_type: queue
-      replay_buffer:
-        enable: false
 explorer:
   runner_per_model: 16
   rollout_model:
     engine_num: 4
     tensor_parallel_size: 1
-    enforce_eager: false
-    enable_prefix_caching: false
-    enable_chunked_prefill: false
     gpu_memory_utilization: 0.9
     dtype: bfloat16
     seed: 42
-    enable_thinking: false
-    enable_history: false
-    enable_openai_api: false
-    enable_auto_tool_choice: false
-    tool_call_parser: null
-    reasoning_parser: null
-  auxiliary_models: []
-  eval_interval: 1000
 trainer:
   trainer_type: verl
   save_interval: 100
-  enable_preview: true
   grad_clip: 1.0
-  max_token_len_per_gpu: 16384
 monitor:
   monitor_type: wandb
 synchronizer:
diff --git a/pyproject.toml b/pyproject.toml
index e92f3ba98e..23a872d22b 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -78,7 +78,7 @@ megatron = [
     "megatron-core[mlm]==0.13.1",
     # if you found "undefined symbol" error in transformer engine
     # reinstall it with --no-build-isolation and `--no-cache-dir` flag
-    "transformer_engine[pytorch]==2.8.0",
+    # "transformer_engine[pytorch]==2.8.0",
     "mbridge>=0.13.0",
 ]
 tinker = [
diff --git a/scripts/docker/Dockerfile.megatron b/scripts/docker/Dockerfile.megatron
index 41c4e88168..c5362258a2 100644
--- a/scripts/docker/Dockerfile.megatron
+++ b/scripts/docker/Dockerfile.megatron
@@ -31,6 +31,7 @@ RUN pip install --upgrade pip \
     && pip install -e .[vllm,mm,dev] \
     && pip install flash_attn==2.8.1 --no-build-isolation \
     && pip install -e .[megatron] \
+    && pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir \
     && NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 pip install -v \
         --disable-pip-version-check --no-cache-dir --no-build-isolation \
         --config-settings "--build-option=--cpp_ext" \
diff --git a/scripts/docker/Dockerfile.uv b/scripts/docker/Dockerfile.uv
index 01492428a5..4d08ecd60d 100644
--- a/scripts/docker/Dockerfile.uv
+++ b/scripts/docker/Dockerfile.uv
@@ -39,9 +39,9 @@ RUN . /opt/venv/bin/activate && \
 
 # Install flash_attn and Megatron
 RUN . /opt/venv/bin/activate && \
-    uv pip install flash_attn==2.8.1 --no-build-isolation && \
     uv pip install -e .[megatron] && \
-    uv pip install --reinstall transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \
+    uv pip install flash_attn==2.8.1 --no-build-isolation && \
+    uv pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \
     NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 \
     uv pip install -v --no-build-isolation \
     --config-settings="--build-option=--cpp_ext" \

From 265b997436da117bac1307ca48ad6ce0fd755272 Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 15:26:58 +0800
Subject: [PATCH 2/7] add news

---
 README.md    | 1 +
 README_zh.md | 1 +
 2 files changed, 2 insertions(+)

diff --git a/README.md b/README.md
index 797b71ec0f..a2fe016c47 100644
--- a/README.md
+++ b/README.md
@@ -42,6 +42,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
 
 ## 🚀 News
 
+* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added Tinker backend for CPU-only training, add more benchmarks.
 * [2025-12] Trinity-RFT has supported [tinker](https://thinkingmachines.ai/tinker/) training backend, which enables model training on devices **without GPUs**.
 * [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)).
 * [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes.
diff --git a/README_zh.md b/README_zh.md
index 9a2ce35baa..5a5e2e82d5 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -41,6 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 ## 🚀 新闻
 
+* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布：新增Tinker 后端以支持在无 GPU 的设备上训练，增加更多基准测试。
 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端，可在**无 GPU 的设备**上进行模型训练。
 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务，让 AI 智能体能够理解模糊症状、主动询问后续问题，并提供精准推荐（[新闻](https://tech.china.com.cn/sx/20251201/411376.shtml)）。
 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布：修复若干 Bug。

From 3b20dbaf66cf656132617f11a658581b4ceaa452 Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 15:32:44 +0800
Subject: [PATCH 3/7] fix comments

---
 docs/sphinx_doc/source/tutorial/example_tinker_backend.md    | 2 +-
 docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
index 7efdf060de..a1a3db5061 100644
--- a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
+++ b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
@@ -64,7 +64,7 @@ trinity run --config tinker.yaml  # Replace with your actual config file path
 
 3. **Multiple stages training** is not supported currently, we will add support for this in the future.
 
-> 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml).
+> 💡 A complete example configuration file is available at [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml).
 
 
 ## Results on the Llama-3.2-3B Model
diff --git a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
index 609837332f..a56d6eb671 100644
--- a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
+++ b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
@@ -64,7 +64,7 @@ trinity run --config tinker.yaml  # 请替换为你的实际配置文件路径
 
 3. **暂不支持多阶段训练**，后续会添加该功能。
 
-> 💡 完整的示例配置文件见 [`tinker.yaml`](tinker.yaml)。
+> 💡 完整的示例配置文件见 [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml)。
 
 
 ## Llama-3.2-3B 模型实验结果

From 914646c05f0e074f17a23ccb5bd46d5c44301a1f Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 15:43:27 +0800
Subject: [PATCH 4/7] update news

---
 README.md    | 3 +--
 README_zh.md | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index a2fe016c47..93103d3227 100644
--- a/README.md
+++ b/README.md
@@ -42,8 +42,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
 
 ## 🚀 News
 
-* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added Tinker backend for CPU-only training, add more benchmarks.
-* [2025-12] Trinity-RFT has supported [tinker](https://thinkingmachines.ai/tinker/) training backend, which enables model training on devices **without GPUs**.
+* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added [Tinker](https://thinkingmachines.ai/tinker/) backend for users **without GPUs**, add more benchmarks, enhance online RL and more.
 * [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)).
 * [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes.
 * [2025-11] Introducing [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask): a framework for training proactive dialogue agents from offline expert data  ([paper](https://arxiv.org/pdf/2510.25441)).
diff --git a/README_zh.md b/README_zh.md
index 5a5e2e82d5..a88906c097 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -41,7 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 ## 🚀 新闻
 
-* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布：新增Tinker 后端以支持在无 GPU 的设备上训练，增加更多基准测试。
+* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布：新增Tinker((https://thinkingmachines.ai/tinker/)) 后端以支持在 **无 GPU** 的设备上训练，增加更多基准测试，增强在线 RL 等功能。
 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端，可在**无 GPU 的设备**上进行模型训练。
 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务，让 AI 智能体能够理解模糊症状、主动询问后续问题，并提供精准推荐（[新闻](https://tech.china.com.cn/sx/20251201/411376.shtml)）。
 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布：修复若干 Bug。

From ff6446bdaeba0fe41e88b0138a2f4a22c7f594ff Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 15:43:38 +0800
Subject: [PATCH 5/7] update news

---
 README_zh.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README_zh.md b/README_zh.md
index a88906c097..39ab7fe61e 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -41,7 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 ## 🚀 新闻
 
-* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布：新增Tinker((https://thinkingmachines.ai/tinker/)) 后端以支持在 **无 GPU** 的设备上训练，增加更多基准测试，增强在线 RL 等功能。
+* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布：新增[Tinker](https://thinkingmachines.ai/tinker/) 后端以支持在 **无 GPU** 的设备上训练，增加更多基准测试，增强在线 RL 等功能。
 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端，可在**无 GPU 的设备**上进行模型训练。
 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务，让 AI 智能体能够理解模糊症状、主动询问后续问题，并提供精准推荐（[新闻](https://tech.china.com.cn/sx/20251201/411376.shtml)）。
 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布：修复若干 Bug。

From b51161f080c51a4d9044f744b69ade959bca761c Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 16:18:18 +0800
Subject: [PATCH 6/7] update readme

---
 README.md    | 2 +-
 README_zh.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 93103d3227..9091074e95 100644
--- a/README.md
+++ b/README.md
@@ -70,7 +70,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
 
 | Category | Tutorial / Guideline      |
 | --- | ----|
-| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)<br>+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)<br>+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)     |
+| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)<br>+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)<br>+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)<br>+ [RL without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) |
 | *Multi-step agentic RL* | + [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)<br>+ [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)<br>+ [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)  <br>+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
 | *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)<br>+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374)) <br>+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441)) <br>+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)<br>+ [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html)  |
 | *Algorithm development* | + [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))<br>+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203)) <br>+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |
diff --git a/README_zh.md b/README_zh.md
index 39ab7fe61e..c873977628 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -71,7 +71,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 | 类别 | 教程 / 指南  |
 | --- | ----|
-| *运行各种 RFT 模式* | + [快速开始：在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)<br>+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)<br>+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)     |
+| *运行各种 RFT 模式* | + [快速开始：在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)<br>+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)<br>+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)<br>+ [在无GPU环境下运行RL训练（Tinker 后端）](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html)   |
 | *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)<br>+ [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)<br>+ [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)  <br>+ [例子：训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
 | *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)<br>+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))<br>+ [研究项目：learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441)) <br>+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)<br>+ [高级数据处理能力 &  Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html)  |
 | *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))<br>+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203)) <br>+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |

From b1a96cd4b519d836bac49f913ec5023d64b21977 Mon Sep 17 00:00:00 2001
From: pxc <panxuchen.pxc@alibaba-inc.com>
Date: Tue, 30 Dec 2025 16:23:12 +0800
Subject: [PATCH 7/7] update readme

---
 README.md    | 2 +-
 README_zh.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 9091074e95..d0e319603a 100644
--- a/README.md
+++ b/README.md
@@ -70,7 +70,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
 
 | Category | Tutorial / Guideline      |
 | --- | ----|
-| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)<br>+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)<br>+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)<br>+ [RL without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) |
+| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)<br>+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)<br>+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)<br>+ [RFT without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) |
 | *Multi-step agentic RL* | + [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)<br>+ [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)<br>+ [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)  <br>+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
 | *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)<br>+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374)) <br>+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441)) <br>+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)<br>+ [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html)  |
 | *Algorithm development* | + [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))<br>+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203)) <br>+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |
diff --git a/README_zh.md b/README_zh.md
index c873977628..9197031c22 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -71,7 +71,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 | 类别 | 教程 / 指南  |
 | --- | ----|
-| *运行各种 RFT 模式* | + [快速开始：在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)<br>+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)<br>+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)<br>+ [在无GPU环境下运行RL训练（Tinker 后端）](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html)   |
+| *运行各种 RFT 模式* | + [快速开始：在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)<br>+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)<br>+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)<br>+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)<br>+ [在无GPU环境下运行RFT训练（Tinker 后端）](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html)   |
 | *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)<br>+ [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)<br>+ [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)  <br>+ [例子：训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
 | *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)<br>+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))<br>+ [研究项目：learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441)) <br>+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)<br>+ [高级数据处理能力 &  Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html)  |
 | *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))<br>+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203)) <br>+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |