diff --git a/README.md b/README.md
index 797b71ec0f..d0e319603a 100644
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
## 🚀 News
-* [2025-12] Trinity-RFT has supported [tinker](https://thinkingmachines.ai/tinker/) training backend, which enables model training on devices **without GPUs**.
+* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added [Tinker](https://thinkingmachines.ai/tinker/) backend for users **without GPUs**, add more benchmarks, enhance online RL and more.
* [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)).
* [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes.
* [2025-11] Introducing [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask): a framework for training proactive dialogue agents from offline expert data ([paper](https://arxiv.org/pdf/2510.25441)).
@@ -70,7 +70,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
| Category | Tutorial / Guideline |
| --- | ----|
-| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html) |
+| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)
+ [RFT without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) |
| *Multi-step agentic RL* | + [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)
+ [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)
+ [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)
+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
| *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)
+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html) |
| *Algorithm development* | + [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))
+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |
diff --git a/README_zh.md b/README_zh.md
index 9a2ce35baa..9197031c22 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -41,6 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能:
## 🚀 新闻
+* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增[Tinker](https://thinkingmachines.ai/tinker/) 后端以支持在 **无 GPU** 的设备上训练,增加更多基准测试,增强在线 RL 等功能。
* [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端,可在**无 GPU 的设备**上进行模型训练。
* [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务,让 AI 智能体能够理解模糊症状、主动询问后续问题,并提供精准推荐([新闻](https://tech.china.com.cn/sx/20251201/411376.shtml))。
* [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布:修复若干 Bug。
@@ -70,7 +71,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能:
| 类别 | 教程 / 指南 |
| --- | ----|
-| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html) |
+| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)
+ [在无GPU环境下运行RFT训练(Tinker 后端)](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) |
| *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)
+ [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)
+ [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)
+ [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
| *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)
+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
+ [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [高级数据处理能力 & Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html) |
| *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))
+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |
diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst
index 265b7b59aa..e232fa8155 100644
--- a/docs/sphinx_doc/source/index.rst
+++ b/docs/sphinx_doc/source/index.rst
@@ -42,6 +42,7 @@ Welcome to Trinity-RFT's documentation!
tutorial/example_react.md
tutorial/example_search_email.md
tutorial/example_dpo.md
+ tutorial/example_tinker_backend.md
tutorial/example_megatron.md
tutorial/example_data_functionalities.md
tutorial/example_dataset_perspective.md
diff --git a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
new file mode 100644
index 0000000000..a1a3db5061
--- /dev/null
+++ b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md
@@ -0,0 +1,214 @@
+# Tinker Backend
+
+```{note}
+This example demonstrates how to use Trinity-RFT with the [Tinker](https://thinkingmachines.ai/tinker/) backend, which enables model training on devices **without GPUs**.
+```
+
+## Setup Instructions
+
+### 1. API Key Configuration
+
+Before starting Ray, you must set the `TRINITY_API_KEY` environment variable to your Tinker API key to enable proper access to Tinker's API:
+
+```bash
+export TRINITY_API_KEY=your_tinker_api_key
+ray start --head
+```
+
+### 2. Configuration File
+
+Configure the Tinker backend in your YAML configuration file by setting the `model.tinker` parameters as shown below:
+
+```yaml
+model:
+ tinker:
+ enable: true
+ base_model: null
+ rank: 32
+ seed: null
+ train_mlp: true
+ train_attn: true
+ train_unembed: true
+```
+
+#### Explanation of Configuration Parameters
+
+- **`tinker`**: Tinker-specific configuration section. **Important**: When Tinker is enabled, any LoRA configuration settings (`model.lora_configs`) will be ignored.
+ - **`enable`**: Whether to activate the Tinker backend. Default: `false`
+ - **`base_model`**: Path to the base model for Tinker. If not specified (`null`), it defaults to the `model_path` defined elsewhere in your config
+ - **`rank`**: The LoRA rank that controls the size of the adaptation matrices. Default: `32`
+ - **`seed`**: Random seed for reproducible Tinker operations. If not specified (`null`), no specific seed is set
+ - **`train_mlp`**: Whether to train the MLP (feed-forward) layers. Default: `true`
+ - **`train_attn`**: Whether to train the attention layers. Default: `true`
+ - **`train_unembed`**: Whether to train the unembedding (output) layer. Default: `true`
+
+
+## Usage
+
+Once configured, Trinity-RFT works with the Tinker backend just like it does with the standard veRL backend. Start training with:
+
+```bash
+trinity run --config tinker.yaml # Replace with your actual config file path
+```
+
+### Important Limitations of the Tinker Backend
+
+1. **Entropy loss** is not consistent compared to veRL backends.
+2. **Algorithms requiring `compute_advantage_in_trainer=true` are NOT supported currently**, including:
+ - PPO (`algorithm.algorithm_type=ppo`)
+ - Reinforce++ (`algorithm.algorithm_type=reinforceplusplus`)
+ - RLOO (`algorithm.algorithm_type=rloo`)
+ - On-policy distillation (`algorithm.algorithm_type=on_policy_distill`)
+
+ Algorithms like `grpo`, `opmd`, `sft` are supported and we will support more algorithms in the future.
+
+3. **Multiple stages training** is not supported currently, we will add support for this in the future.
+
+> 💡 A complete example configuration file is available at [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml).
+
+
+## Results on the Llama-3.2-3B Model
+
+We trained the **Llama-3.2-3B** model on the **GSM8K** dataset using both the **Tinker** and **veRL** backends. Below are the full configuration files used in our experiments.
+
+
+Click to expand: Tinker Backend Configuration
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: tinker-llama3.2-3B-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+ algorithm_type: grpo
+ repeat_times: 8
+ kl_loss_fn_args:
+ kl_coef: 0.0
+ optimizer:
+ lr: 1.0e-05
+model:
+ model_path: meta-llama/Llama-3.2-3B
+ max_prompt_tokens: 1024
+ max_response_tokens: 2048
+ custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+ tinker:
+ enable: true
+ base_model: meta-llama/Llama-3.2-3B
+cluster:
+ node_num: 1
+ gpu_per_node: 8
+buffer:
+ batch_size: 96
+ total_epochs: 1
+ explorer_input:
+ taskset:
+ name: taskset
+ storage_type: file
+ path: openai/gsm8k
+ split: train
+ subset_name: main
+ format:
+ prompt_key: question
+ response_key: answer
+ default_workflow_type: math_workflow
+ trainer_input:
+ experience_buffer:
+ name: experience_buffer
+ storage_type: queue
+explorer:
+ runner_per_model: 16
+ rollout_model:
+ engine_num: 4
+ seed: 42
+trainer:
+ save_interval: 100
+ grad_clip: 1.0
+monitor:
+ monitor_type: wandb
+synchronizer:
+ sync_method: checkpoint
+ sync_style: fixed
+ sync_interval: 1
+ sync_offset: 1
+ sync_timeout: 1200
+```
+
+
+
+
+Click to expand: veRL Backend Configuration (LoRA)
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: verl-llama3.2-3B-lora-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+ algorithm_type: grpo
+ repeat_times: 8
+ kl_loss_fn_args:
+ kl_coef: 0.0
+ optimizer:
+ lr: 1.0e-05
+data_processor: {}
+model:
+ model_path: meta-llama/Llama-3.2-3B
+ max_prompt_tokens: 1024
+ max_response_tokens: 2048
+ custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+ lora_configs:
+ - name: lora
+ lora_rank: 32
+ lora_alpha: 32
+cluster:
+ node_num: 1
+ gpu_per_node: 8
+buffer:
+ batch_size: 96
+ total_epochs: 1
+ explorer_input:
+ taskset:
+ name: taskset
+ storage_type: file
+ path: openai/gsm8k
+ split: train
+ subset_name: main
+ format:
+ prompt_key: question
+ response_key: answer
+ default_workflow_type: math_workflow
+ trainer_input:
+ experience_buffer:
+ name: experience_buffer
+ storage_type: queue
+explorer:
+ runner_per_model: 16
+ rollout_model:
+ engine_num: 4
+ tensor_parallel_size: 1
+ gpu_memory_utilization: 0.9
+ dtype: bfloat16
+ seed: 42
+trainer:
+ trainer_type: verl
+ save_interval: 100
+ grad_clip: 1.0
+monitor:
+ monitor_type: wandb
+synchronizer:
+ sync_method: checkpoint
+ sync_style: fixed
+ sync_interval: 1
+ sync_offset: 1
+ sync_timeout: 1200
+```
+
+
+
+### Observations
+
+Since Llama-3.2-3B is a base (non-instruct-tuned) model, it has limited ability to follow formatting instructions. Additionally, we trained for only **one epoch**. As a result, both backends achieved final rewards just slightly above 0.1. Nonetheless, the training curves show a clear upward trend in reward, indicating successful learning. The results are visualized below:
+
+
diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst
index 798ded6181..09105b12bb 100644
--- a/docs/sphinx_doc/source_zh/index.rst
+++ b/docs/sphinx_doc/source_zh/index.rst
@@ -40,6 +40,7 @@
tutorial/example_react.md
tutorial/example_search_email.md
tutorial/example_dpo.md
+ tutorial/example_tinker_backend.md
tutorial/example_megatron.md
tutorial/example_data_functionalities.md
tutorial/example_dataset_perspective.md
diff --git a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
new file mode 100644
index 0000000000..a56d6eb671
--- /dev/null
+++ b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md
@@ -0,0 +1,213 @@
+# Tinker 后端
+
+```{note}
+本示例演示了如何在 Trinity-RFT 中使用 [Tinker](https://thinkingmachines.ai/tinker/),从而在**无 GPU**的设备上进行模型训练。
+```
+
+## 安装与配置
+
+### 1. API Key 配置
+
+在启动 Ray 之前,必须将 `TRINITY_API_KEY` 环境变量设置为你的 Tinker API 密钥,以便正确访问 Tinker 的 API:
+
+```bash
+export TRINITY_API_KEY=your_tinker_api_key
+ray start --head
+```
+
+### 2. 配置文件
+
+在 YAML 配置文件中通过如下方式设置 `model.tinker` 参数以启用 Tinker 后端:
+
+```yaml
+model:
+ tinker:
+ enable: true
+ base_model: null
+ rank: 32
+ seed: null
+ train_mlp: true
+ train_attn: true
+ train_unembed: true
+```
+
+#### 配置参数说明
+
+- **`tinker`**:Tinker 专用配置部分。**注意**:启用 Tinker 后,所有 LoRA 配置(`model.lora_configs`)将被忽略。
+ - **`enable`**:是否启用 Tinker 后端。默认值:`false`
+ - **`base_model`**:Tinker 的基础模型路径。如果未指定(`null`),则默认为配置中其他位置的 `model_path`
+ - **`rank`**:LoRA 的秩,控制适应矩阵的大小。默认值:`32`
+ - **`seed`**:Tinker 操作的随机种子。未指定(`null`)时不设定特定种子
+ - **`train_mlp`**:是否训练 MLP(前馈)层。默认值:`true`
+ - **`train_attn`**:是否训练注意力层。默认值:`true`
+ - **`train_unembed`**:是否训练输出(unembedding)层。默认值:`true`
+
+
+## 使用方法
+
+配置完成后,Trinity-RFT 使用 Tinker 后端的方式与标准 veRL 后端一致。启动训练命令如下:
+
+```bash
+trinity run --config tinker.yaml # 请替换为你的实际配置文件路径
+```
+
+### Tinker 后端的功能限制
+
+1. **熵损失(entropy loss)** 与 veRL 后端不完全一致。
+2. **不支持 `compute_advantage_in_trainer=true` 的算法**,包括:
+ - PPO(`algorithm.algorithm_type=ppo`)
+ - Reinforce++(`algorithm.algorithm_type=reinforceplusplus`)
+ - RLOO(`algorithm.algorithm_type=rloo`)
+ - On-policy distillation(`algorithm.algorithm_type=on_policy_distill`)
+
+ 目前支持 `grpo`, `opmd`, `sft` 等算法,未来会支持更多算法。
+
+3. **暂不支持多阶段训练**,后续会添加该功能。
+
+> 💡 完整的示例配置文件见 [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml)。
+
+
+## Llama-3.2-3B 模型实验结果
+
+我们在 **GSM8K** 数据集上,分别使用 **Tinker** 和 **veRL** 后端对 **Llama-3.2-3B** 模型进行了训练。以下为实验中使用的完整配置文件。
+
+点击展开:Tinker 后端配置
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: tinker-llama3.2-3B-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+ algorithm_type: grpo
+ repeat_times: 8
+ kl_loss_fn_args:
+ kl_coef: 0.0
+ optimizer:
+ lr: 1.0e-05
+model:
+ model_path: meta-llama/Llama-3.2-3B
+ max_prompt_tokens: 1024
+ max_response_tokens: 2048
+ custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+ tinker:
+ enable: true
+ base_model: meta-llama/Llama-3.2-3B
+cluster:
+ node_num: 1
+ gpu_per_node: 8
+buffer:
+ batch_size: 96
+ total_epochs: 1
+ explorer_input:
+ taskset:
+ name: taskset
+ storage_type: file
+ path: openai/gsm8k
+ split: train
+ subset_name: main
+ format:
+ prompt_key: question
+ response_key: answer
+ default_workflow_type: math_workflow
+ trainer_input:
+ experience_buffer:
+ name: experience_buffer
+ storage_type: queue
+explorer:
+ runner_per_model: 16
+ rollout_model:
+ engine_num: 4
+ seed: 42
+trainer:
+ save_interval: 100
+ grad_clip: 1.0
+monitor:
+ monitor_type: wandb
+synchronizer:
+ sync_method: checkpoint
+ sync_style: fixed
+ sync_interval: 1
+ sync_offset: 1
+ sync_timeout: 1200
+```
+
+
+
+
+点击展开:veRL 后端配置(LoRA)
+
+```yaml
+mode: both
+project: Trinity-RFT-gsm8k
+group: alignment-tinker
+name: verl-llama3.2-3B-lora-off1
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+ algorithm_type: grpo
+ repeat_times: 8
+ kl_loss_fn_args:
+ kl_coef: 0.0
+ optimizer:
+ lr: 1.0e-05
+data_processor: {}
+model:
+ model_path: meta-llama/Llama-3.2-3B
+ max_prompt_tokens: 1024
+ max_response_tokens: 2048
+ custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+ lora_configs:
+ - name: lora
+ lora_rank: 32
+ lora_alpha: 32
+cluster:
+ node_num: 1
+ gpu_per_node: 8
+buffer:
+ batch_size: 96
+ total_epochs: 1
+ explorer_input:
+ taskset:
+ name: taskset
+ storage_type: file
+ path: openai/gsm8k
+ split: train
+ subset_name: main
+ format:
+ prompt_key: question
+ response_key: answer
+ default_workflow_type: math_workflow
+ trainer_input:
+ experience_buffer:
+ name: experience_buffer
+ storage_type: queue
+explorer:
+ runner_per_model: 16
+ rollout_model:
+ engine_num: 4
+ tensor_parallel_size: 1
+ gpu_memory_utilization: 0.9
+ dtype: bfloat16
+ seed: 42
+trainer:
+ trainer_type: verl
+ save_interval: 100
+ grad_clip: 1.0
+monitor:
+ monitor_type: wandb
+synchronizer:
+ sync_method: checkpoint
+ sync_style: fixed
+ sync_interval: 1
+ sync_offset: 1
+ sync_timeout: 1200
+```
+
+
+
+### 结果说明
+
+由于 Llama-3.2-3B 是基础(非指令微调)模型,其格式化指令跟随能力有限,且本实验仅训练了**一个 epoch**。因此,两种后端的最终 reward 都略高于 0.1。但训练曲线显示 reward 呈明显上升趋势,表明模型已成功学习。结果可视化如下:
+
+
diff --git a/examples/tinker/README.md b/examples/tinker/README.md
index 79de2c959f..2c55fdf294 100644
--- a/examples/tinker/README.md
+++ b/examples/tinker/README.md
@@ -56,7 +56,8 @@ trinity run --config tinker.yaml # Replace with your actual config file path
- RLOO (`algorithm.algorithm_type=rloo`)
- On-policy distillation (`algorithm.algorithm_type=on_policy_distill`)
- Algorithms like `algorithm.algorithm_type=grpo` are supported. We will add support for these algorithms in the future.
+ Algorithms like `grpo`, `opmd`, `sft` are supported and we will add support for more algorithms in the future.
+
3. **Multiple stages training** is not supported currently, we will add support for this in the future.
> 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml).
@@ -78,14 +79,10 @@ checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
algorithm:
algorithm_type: grpo
repeat_times: 8
- sample_strategy: default
kl_loss_fn_args:
kl_coef: 0.0
optimizer:
lr: 1.0e-05
- lr_warmup_steps_ratio: 0.0
- warmup_style: constant
-data_processor: {}
model:
model_path: meta-llama/Llama-3.2-3B
max_prompt_tokens: 1024
@@ -110,29 +107,19 @@ buffer:
format:
prompt_key: question
response_key: answer
- rollout_args:
- temperature: 1.0
- logprobs: 0
- eval_tasksets: []
default_workflow_type: math_workflow
trainer_input:
experience_buffer:
name: experience_buffer
storage_type: queue
- replay_buffer:
- enable: false
explorer:
runner_per_model: 16
rollout_model:
engine_num: 4
seed: 42
- auxiliary_models: []
- eval_interval: 1000
trainer:
save_interval: 100
- enable_preview: true
grad_clip: 1.0
- max_token_len_per_gpu: 16384
monitor:
monitor_type: wandb
synchronizer:
@@ -157,13 +144,10 @@ checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
algorithm:
algorithm_type: grpo
repeat_times: 8
- sample_strategy: default
kl_loss_fn_args:
kl_coef: 0.0
optimizer:
lr: 1.0e-05
- lr_warmup_steps_ratio: 0.0
- warmup_style: constant
data_processor: {}
model:
model_path: meta-llama/Llama-3.2-3B
@@ -190,42 +174,23 @@ buffer:
format:
prompt_key: question
response_key: answer
- rollout_args:
- temperature: 1.0
- logprobs: 0
- eval_tasksets: []
default_workflow_type: math_workflow
trainer_input:
experience_buffer:
name: experience_buffer
storage_type: queue
- replay_buffer:
- enable: false
explorer:
runner_per_model: 16
rollout_model:
engine_num: 4
tensor_parallel_size: 1
- enforce_eager: false
- enable_prefix_caching: false
- enable_chunked_prefill: false
gpu_memory_utilization: 0.9
dtype: bfloat16
seed: 42
- enable_thinking: false
- enable_history: false
- enable_openai_api: false
- enable_auto_tool_choice: false
- tool_call_parser: null
- reasoning_parser: null
- auxiliary_models: []
- eval_interval: 1000
trainer:
trainer_type: verl
save_interval: 100
- enable_preview: true
grad_clip: 1.0
- max_token_len_per_gpu: 16384
monitor:
monitor_type: wandb
synchronizer:
diff --git a/pyproject.toml b/pyproject.toml
index e92f3ba98e..23a872d22b 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -78,7 +78,7 @@ megatron = [
"megatron-core[mlm]==0.13.1",
# if you found "undefined symbol" error in transformer engine
# reinstall it with --no-build-isolation and `--no-cache-dir` flag
- "transformer_engine[pytorch]==2.8.0",
+ # "transformer_engine[pytorch]==2.8.0",
"mbridge>=0.13.0",
]
tinker = [
diff --git a/scripts/docker/Dockerfile.megatron b/scripts/docker/Dockerfile.megatron
index 41c4e88168..c5362258a2 100644
--- a/scripts/docker/Dockerfile.megatron
+++ b/scripts/docker/Dockerfile.megatron
@@ -31,6 +31,7 @@ RUN pip install --upgrade pip \
&& pip install -e .[vllm,mm,dev] \
&& pip install flash_attn==2.8.1 --no-build-isolation \
&& pip install -e .[megatron] \
+ && pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir \
&& NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 pip install -v \
--disable-pip-version-check --no-cache-dir --no-build-isolation \
--config-settings "--build-option=--cpp_ext" \
diff --git a/scripts/docker/Dockerfile.uv b/scripts/docker/Dockerfile.uv
index 01492428a5..4d08ecd60d 100644
--- a/scripts/docker/Dockerfile.uv
+++ b/scripts/docker/Dockerfile.uv
@@ -39,9 +39,9 @@ RUN . /opt/venv/bin/activate && \
# Install flash_attn and Megatron
RUN . /opt/venv/bin/activate && \
- uv pip install flash_attn==2.8.1 --no-build-isolation && \
uv pip install -e .[megatron] && \
- uv pip install --reinstall transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \
+ uv pip install flash_attn==2.8.1 --no-build-isolation && \
+ uv pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \
NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 \
uv pip install -v --no-build-isolation \
--config-settings="--build-option=--cpp_ext" \