From 2eae4078ca7771403c39d1d15e3c5d1c715b596a Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 15:23:41 +0800 Subject: [PATCH 1/7] add tinker backend doc --- docs/sphinx_doc/source/index.rst | 1 + .../source/tutorial/example_tinker_backend.md | 214 ++++++++++++++++++ docs/sphinx_doc/source_zh/index.rst | 1 + .../tutorial/example_tinker_backend.md | 213 +++++++++++++++++ examples/tinker/README.md | 39 +--- pyproject.toml | 2 +- scripts/docker/Dockerfile.megatron | 1 + scripts/docker/Dockerfile.uv | 4 +- 8 files changed, 435 insertions(+), 40 deletions(-) create mode 100644 docs/sphinx_doc/source/tutorial/example_tinker_backend.md create mode 100644 docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst index 265b7b59aa..e232fa8155 100644 --- a/docs/sphinx_doc/source/index.rst +++ b/docs/sphinx_doc/source/index.rst @@ -42,6 +42,7 @@ Welcome to Trinity-RFT's documentation! tutorial/example_react.md tutorial/example_search_email.md tutorial/example_dpo.md + tutorial/example_tinker_backend.md tutorial/example_megatron.md tutorial/example_data_functionalities.md tutorial/example_dataset_perspective.md diff --git a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md new file mode 100644 index 0000000000..7efdf060de --- /dev/null +++ b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md @@ -0,0 +1,214 @@ +# Tinker Backend + +```{note} +This example demonstrates how to use Trinity-RFT with the [Tinker](https://thinkingmachines.ai/tinker/) backend, which enables model training on devices **without GPUs**. +``` + +## Setup Instructions + +### 1. API Key Configuration + +Before starting Ray, you must set the `TRINITY_API_KEY` environment variable to your Tinker API key to enable proper access to Tinker's API: + +```bash +export TRINITY_API_KEY=your_tinker_api_key +ray start --head +``` + +### 2. Configuration File + +Configure the Tinker backend in your YAML configuration file by setting the `model.tinker` parameters as shown below: + +```yaml +model: + tinker: + enable: true + base_model: null + rank: 32 + seed: null + train_mlp: true + train_attn: true + train_unembed: true +``` + +#### Explanation of Configuration Parameters + +- **`tinker`**: Tinker-specific configuration section. **Important**: When Tinker is enabled, any LoRA configuration settings (`model.lora_configs`) will be ignored. + - **`enable`**: Whether to activate the Tinker backend. Default: `false` + - **`base_model`**: Path to the base model for Tinker. If not specified (`null`), it defaults to the `model_path` defined elsewhere in your config + - **`rank`**: The LoRA rank that controls the size of the adaptation matrices. Default: `32` + - **`seed`**: Random seed for reproducible Tinker operations. If not specified (`null`), no specific seed is set + - **`train_mlp`**: Whether to train the MLP (feed-forward) layers. Default: `true` + - **`train_attn`**: Whether to train the attention layers. Default: `true` + - **`train_unembed`**: Whether to train the unembedding (output) layer. Default: `true` + + +## Usage + +Once configured, Trinity-RFT works with the Tinker backend just like it does with the standard veRL backend. Start training with: + +```bash +trinity run --config tinker.yaml # Replace with your actual config file path +``` + +### Important Limitations of the Tinker Backend + +1. **Entropy loss** is not consistent compared to veRL backends. +2. **Algorithms requiring `compute_advantage_in_trainer=true` are NOT supported currently**, including: + - PPO (`algorithm.algorithm_type=ppo`) + - Reinforce++ (`algorithm.algorithm_type=reinforceplusplus`) + - RLOO (`algorithm.algorithm_type=rloo`) + - On-policy distillation (`algorithm.algorithm_type=on_policy_distill`) + + Algorithms like `grpo`, `opmd`, `sft` are supported and we will support more algorithms in the future. + +3. **Multiple stages training** is not supported currently, we will add support for this in the future. + +> 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml). + + +## Results on the Llama-3.2-3B Model + +We trained the **Llama-3.2-3B** model on the **GSM8K** dataset using both the **Tinker** and **veRL** backends. Below are the full configuration files used in our experiments. + + +
Click to expand: Tinker Backend Configuration + +```yaml +mode: both +project: Trinity-RFT-gsm8k +group: alignment-tinker +name: tinker-llama3.2-3B-off1 +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} +algorithm: + algorithm_type: grpo + repeat_times: 8 + kl_loss_fn_args: + kl_coef: 0.0 + optimizer: + lr: 1.0e-05 +model: + model_path: meta-llama/Llama-3.2-3B + max_prompt_tokens: 1024 + max_response_tokens: 2048 + custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n" + tinker: + enable: true + base_model: meta-llama/Llama-3.2-3B +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + batch_size: 96 + total_epochs: 1 + explorer_input: + taskset: + name: taskset + storage_type: file + path: openai/gsm8k + split: train + subset_name: main + format: + prompt_key: question + response_key: answer + default_workflow_type: math_workflow + trainer_input: + experience_buffer: + name: experience_buffer + storage_type: queue +explorer: + runner_per_model: 16 + rollout_model: + engine_num: 4 + seed: 42 +trainer: + save_interval: 100 + grad_clip: 1.0 +monitor: + monitor_type: wandb +synchronizer: + sync_method: checkpoint + sync_style: fixed + sync_interval: 1 + sync_offset: 1 + sync_timeout: 1200 +``` + +
+ + +
Click to expand: veRL Backend Configuration (LoRA) + +```yaml +mode: both +project: Trinity-RFT-gsm8k +group: alignment-tinker +name: verl-llama3.2-3B-lora-off1 +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} +algorithm: + algorithm_type: grpo + repeat_times: 8 + kl_loss_fn_args: + kl_coef: 0.0 + optimizer: + lr: 1.0e-05 +data_processor: {} +model: + model_path: meta-llama/Llama-3.2-3B + max_prompt_tokens: 1024 + max_response_tokens: 2048 + custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n" + lora_configs: + - name: lora + lora_rank: 32 + lora_alpha: 32 +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + batch_size: 96 + total_epochs: 1 + explorer_input: + taskset: + name: taskset + storage_type: file + path: openai/gsm8k + split: train + subset_name: main + format: + prompt_key: question + response_key: answer + default_workflow_type: math_workflow + trainer_input: + experience_buffer: + name: experience_buffer + storage_type: queue +explorer: + runner_per_model: 16 + rollout_model: + engine_num: 4 + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + dtype: bfloat16 + seed: 42 +trainer: + trainer_type: verl + save_interval: 100 + grad_clip: 1.0 +monitor: + monitor_type: wandb +synchronizer: + sync_method: checkpoint + sync_style: fixed + sync_interval: 1 + sync_offset: 1 + sync_timeout: 1200 +``` + +
+ +### Observations + +Since Llama-3.2-3B is a base (non-instruct-tuned) model, it has limited ability to follow formatting instructions. Additionally, we trained for only **one epoch**. As a result, both backends achieved final rewards just slightly above 0.1. Nonetheless, the training curves show a clear upward trend in reward, indicating successful learning. The results are visualized below: + +![Training Rewards on GSM8K](../../docs/sphinx_doc/assets/tinker-gsm8k.png) diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst index 798ded6181..09105b12bb 100644 --- a/docs/sphinx_doc/source_zh/index.rst +++ b/docs/sphinx_doc/source_zh/index.rst @@ -40,6 +40,7 @@ tutorial/example_react.md tutorial/example_search_email.md tutorial/example_dpo.md + tutorial/example_tinker_backend.md tutorial/example_megatron.md tutorial/example_data_functionalities.md tutorial/example_dataset_perspective.md diff --git a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md new file mode 100644 index 0000000000..609837332f --- /dev/null +++ b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md @@ -0,0 +1,213 @@ +# Tinker 后端 + +```{note} +本示例演示了如何在 Trinity-RFT 中使用 [Tinker](https://thinkingmachines.ai/tinker/),从而在**无 GPU**的设备上进行模型训练。 +``` + +## 安装与配置 + +### 1. API Key 配置 + +在启动 Ray 之前,必须将 `TRINITY_API_KEY` 环境变量设置为你的 Tinker API 密钥,以便正确访问 Tinker 的 API: + +```bash +export TRINITY_API_KEY=your_tinker_api_key +ray start --head +``` + +### 2. 配置文件 + +在 YAML 配置文件中通过如下方式设置 `model.tinker` 参数以启用 Tinker 后端: + +```yaml +model: + tinker: + enable: true + base_model: null + rank: 32 + seed: null + train_mlp: true + train_attn: true + train_unembed: true +``` + +#### 配置参数说明 + +- **`tinker`**:Tinker 专用配置部分。**注意**:启用 Tinker 后,所有 LoRA 配置(`model.lora_configs`)将被忽略。 + - **`enable`**:是否启用 Tinker 后端。默认值:`false` + - **`base_model`**:Tinker 的基础模型路径。如果未指定(`null`),则默认为配置中其他位置的 `model_path` + - **`rank`**:LoRA 的秩,控制适应矩阵的大小。默认值:`32` + - **`seed`**:Tinker 操作的随机种子。未指定(`null`)时不设定特定种子 + - **`train_mlp`**:是否训练 MLP(前馈)层。默认值:`true` + - **`train_attn`**:是否训练注意力层。默认值:`true` + - **`train_unembed`**:是否训练输出(unembedding)层。默认值:`true` + + +## 使用方法 + +配置完成后,Trinity-RFT 使用 Tinker 后端的方式与标准 veRL 后端一致。启动训练命令如下: + +```bash +trinity run --config tinker.yaml # 请替换为你的实际配置文件路径 +``` + +### Tinker 后端的功能限制 + +1. **熵损失(entropy loss)** 与 veRL 后端不完全一致。 +2. **不支持 `compute_advantage_in_trainer=true` 的算法**,包括: + - PPO(`algorithm.algorithm_type=ppo`) + - Reinforce++(`algorithm.algorithm_type=reinforceplusplus`) + - RLOO(`algorithm.algorithm_type=rloo`) + - On-policy distillation(`algorithm.algorithm_type=on_policy_distill`) + + 目前支持 `grpo`, `opmd`, `sft` 等算法,未来会支持更多算法。 + +3. **暂不支持多阶段训练**,后续会添加该功能。 + +> 💡 完整的示例配置文件见 [`tinker.yaml`](tinker.yaml)。 + + +## Llama-3.2-3B 模型实验结果 + +我们在 **GSM8K** 数据集上,分别使用 **Tinker** 和 **veRL** 后端对 **Llama-3.2-3B** 模型进行了训练。以下为实验中使用的完整配置文件。 + +
点击展开:Tinker 后端配置 + +```yaml +mode: both +project: Trinity-RFT-gsm8k +group: alignment-tinker +name: tinker-llama3.2-3B-off1 +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} +algorithm: + algorithm_type: grpo + repeat_times: 8 + kl_loss_fn_args: + kl_coef: 0.0 + optimizer: + lr: 1.0e-05 +model: + model_path: meta-llama/Llama-3.2-3B + max_prompt_tokens: 1024 + max_response_tokens: 2048 + custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n" + tinker: + enable: true + base_model: meta-llama/Llama-3.2-3B +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + batch_size: 96 + total_epochs: 1 + explorer_input: + taskset: + name: taskset + storage_type: file + path: openai/gsm8k + split: train + subset_name: main + format: + prompt_key: question + response_key: answer + default_workflow_type: math_workflow + trainer_input: + experience_buffer: + name: experience_buffer + storage_type: queue +explorer: + runner_per_model: 16 + rollout_model: + engine_num: 4 + seed: 42 +trainer: + save_interval: 100 + grad_clip: 1.0 +monitor: + monitor_type: wandb +synchronizer: + sync_method: checkpoint + sync_style: fixed + sync_interval: 1 + sync_offset: 1 + sync_timeout: 1200 +``` + +
+ + +
点击展开:veRL 后端配置(LoRA) + +```yaml +mode: both +project: Trinity-RFT-gsm8k +group: alignment-tinker +name: verl-llama3.2-3B-lora-off1 +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} +algorithm: + algorithm_type: grpo + repeat_times: 8 + kl_loss_fn_args: + kl_coef: 0.0 + optimizer: + lr: 1.0e-05 +data_processor: {} +model: + model_path: meta-llama/Llama-3.2-3B + max_prompt_tokens: 1024 + max_response_tokens: 2048 + custom_chat_template: "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n" + lora_configs: + - name: lora + lora_rank: 32 + lora_alpha: 32 +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + batch_size: 96 + total_epochs: 1 + explorer_input: + taskset: + name: taskset + storage_type: file + path: openai/gsm8k + split: train + subset_name: main + format: + prompt_key: question + response_key: answer + default_workflow_type: math_workflow + trainer_input: + experience_buffer: + name: experience_buffer + storage_type: queue +explorer: + runner_per_model: 16 + rollout_model: + engine_num: 4 + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + dtype: bfloat16 + seed: 42 +trainer: + trainer_type: verl + save_interval: 100 + grad_clip: 1.0 +monitor: + monitor_type: wandb +synchronizer: + sync_method: checkpoint + sync_style: fixed + sync_interval: 1 + sync_offset: 1 + sync_timeout: 1200 +``` + +
+ +### 结果说明 + +由于 Llama-3.2-3B 是基础(非指令微调)模型,其格式化指令跟随能力有限,且本实验仅训练了**一个 epoch**。因此,两种后端的最终 reward 都略高于 0.1。但训练曲线显示 reward 呈明显上升趋势,表明模型已成功学习。结果可视化如下: + +![GSM8K 训练奖励曲线](../../docs/sphinx_doc/assets/tinker-gsm8k.png) diff --git a/examples/tinker/README.md b/examples/tinker/README.md index 79de2c959f..2c55fdf294 100644 --- a/examples/tinker/README.md +++ b/examples/tinker/README.md @@ -56,7 +56,8 @@ trinity run --config tinker.yaml # Replace with your actual config file path - RLOO (`algorithm.algorithm_type=rloo`) - On-policy distillation (`algorithm.algorithm_type=on_policy_distill`) - Algorithms like `algorithm.algorithm_type=grpo` are supported. We will add support for these algorithms in the future. + Algorithms like `grpo`, `opmd`, `sft` are supported and we will add support for more algorithms in the future. + 3. **Multiple stages training** is not supported currently, we will add support for this in the future. > 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml). @@ -78,14 +79,10 @@ checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} algorithm: algorithm_type: grpo repeat_times: 8 - sample_strategy: default kl_loss_fn_args: kl_coef: 0.0 optimizer: lr: 1.0e-05 - lr_warmup_steps_ratio: 0.0 - warmup_style: constant -data_processor: {} model: model_path: meta-llama/Llama-3.2-3B max_prompt_tokens: 1024 @@ -110,29 +107,19 @@ buffer: format: prompt_key: question response_key: answer - rollout_args: - temperature: 1.0 - logprobs: 0 - eval_tasksets: [] default_workflow_type: math_workflow trainer_input: experience_buffer: name: experience_buffer storage_type: queue - replay_buffer: - enable: false explorer: runner_per_model: 16 rollout_model: engine_num: 4 seed: 42 - auxiliary_models: [] - eval_interval: 1000 trainer: save_interval: 100 - enable_preview: true grad_clip: 1.0 - max_token_len_per_gpu: 16384 monitor: monitor_type: wandb synchronizer: @@ -157,13 +144,10 @@ checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} algorithm: algorithm_type: grpo repeat_times: 8 - sample_strategy: default kl_loss_fn_args: kl_coef: 0.0 optimizer: lr: 1.0e-05 - lr_warmup_steps_ratio: 0.0 - warmup_style: constant data_processor: {} model: model_path: meta-llama/Llama-3.2-3B @@ -190,42 +174,23 @@ buffer: format: prompt_key: question response_key: answer - rollout_args: - temperature: 1.0 - logprobs: 0 - eval_tasksets: [] default_workflow_type: math_workflow trainer_input: experience_buffer: name: experience_buffer storage_type: queue - replay_buffer: - enable: false explorer: runner_per_model: 16 rollout_model: engine_num: 4 tensor_parallel_size: 1 - enforce_eager: false - enable_prefix_caching: false - enable_chunked_prefill: false gpu_memory_utilization: 0.9 dtype: bfloat16 seed: 42 - enable_thinking: false - enable_history: false - enable_openai_api: false - enable_auto_tool_choice: false - tool_call_parser: null - reasoning_parser: null - auxiliary_models: [] - eval_interval: 1000 trainer: trainer_type: verl save_interval: 100 - enable_preview: true grad_clip: 1.0 - max_token_len_per_gpu: 16384 monitor: monitor_type: wandb synchronizer: diff --git a/pyproject.toml b/pyproject.toml index e92f3ba98e..23a872d22b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -78,7 +78,7 @@ megatron = [ "megatron-core[mlm]==0.13.1", # if you found "undefined symbol" error in transformer engine # reinstall it with --no-build-isolation and `--no-cache-dir` flag - "transformer_engine[pytorch]==2.8.0", + # "transformer_engine[pytorch]==2.8.0", "mbridge>=0.13.0", ] tinker = [ diff --git a/scripts/docker/Dockerfile.megatron b/scripts/docker/Dockerfile.megatron index 41c4e88168..c5362258a2 100644 --- a/scripts/docker/Dockerfile.megatron +++ b/scripts/docker/Dockerfile.megatron @@ -31,6 +31,7 @@ RUN pip install --upgrade pip \ && pip install -e .[vllm,mm,dev] \ && pip install flash_attn==2.8.1 --no-build-isolation \ && pip install -e .[megatron] \ + && pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir \ && NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 pip install -v \ --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext" \ diff --git a/scripts/docker/Dockerfile.uv b/scripts/docker/Dockerfile.uv index 01492428a5..4d08ecd60d 100644 --- a/scripts/docker/Dockerfile.uv +++ b/scripts/docker/Dockerfile.uv @@ -39,9 +39,9 @@ RUN . /opt/venv/bin/activate && \ # Install flash_attn and Megatron RUN . /opt/venv/bin/activate && \ - uv pip install flash_attn==2.8.1 --no-build-isolation && \ uv pip install -e .[megatron] && \ - uv pip install --reinstall transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \ + uv pip install flash_attn==2.8.1 --no-build-isolation && \ + uv pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \ NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 \ uv pip install -v --no-build-isolation \ --config-settings="--build-option=--cpp_ext" \ From 265b997436da117bac1307ca48ad6ce0fd755272 Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 15:26:58 +0800 Subject: [PATCH 2/7] add news --- README.md | 1 + README_zh.md | 1 + 2 files changed, 2 insertions(+) diff --git a/README.md b/README.md index 797b71ec0f..a2fe016c47 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob ## 🚀 News +* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added Tinker backend for CPU-only training, add more benchmarks. * [2025-12] Trinity-RFT has supported [tinker](https://thinkingmachines.ai/tinker/) training backend, which enables model training on devices **without GPUs**. * [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)). * [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes. diff --git a/README_zh.md b/README_zh.md index 9a2ce35baa..5a5e2e82d5 100644 --- a/README_zh.md +++ b/README_zh.md @@ -41,6 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: ## 🚀 新闻 +* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增Tinker 后端以支持在无 GPU 的设备上训练,增加更多基准测试。 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端,可在**无 GPU 的设备**上进行模型训练。 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务,让 AI 智能体能够理解模糊症状、主动询问后续问题,并提供精准推荐([新闻](https://tech.china.com.cn/sx/20251201/411376.shtml))。 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布:修复若干 Bug。 From 3b20dbaf66cf656132617f11a658581b4ceaa452 Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 15:32:44 +0800 Subject: [PATCH 3/7] fix comments --- docs/sphinx_doc/source/tutorial/example_tinker_backend.md | 2 +- docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md index 7efdf060de..a1a3db5061 100644 --- a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md +++ b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md @@ -64,7 +64,7 @@ trinity run --config tinker.yaml # Replace with your actual config file path 3. **Multiple stages training** is not supported currently, we will add support for this in the future. -> 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml). +> 💡 A complete example configuration file is available at [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml). ## Results on the Llama-3.2-3B Model diff --git a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md index 609837332f..a56d6eb671 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md @@ -64,7 +64,7 @@ trinity run --config tinker.yaml # 请替换为你的实际配置文件路径 3. **暂不支持多阶段训练**,后续会添加该功能。 -> 💡 完整的示例配置文件见 [`tinker.yaml`](tinker.yaml)。 +> 💡 完整的示例配置文件见 [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml)。 ## Llama-3.2-3B 模型实验结果 From 914646c05f0e074f17a23ccb5bd46d5c44301a1f Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 15:43:27 +0800 Subject: [PATCH 4/7] update news --- README.md | 3 +-- README_zh.md | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index a2fe016c47..93103d3227 100644 --- a/README.md +++ b/README.md @@ -42,8 +42,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob ## 🚀 News -* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added Tinker backend for CPU-only training, add more benchmarks. -* [2025-12] Trinity-RFT has supported [tinker](https://thinkingmachines.ai/tinker/) training backend, which enables model training on devices **without GPUs**. +* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added [Tinker](https://thinkingmachines.ai/tinker/) backend for users **without GPUs**, add more benchmarks, enhance online RL and more. * [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)). * [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes. * [2025-11] Introducing [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask): a framework for training proactive dialogue agents from offline expert data ([paper](https://arxiv.org/pdf/2510.25441)). diff --git a/README_zh.md b/README_zh.md index 5a5e2e82d5..a88906c097 100644 --- a/README_zh.md +++ b/README_zh.md @@ -41,7 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: ## 🚀 新闻 -* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增Tinker 后端以支持在无 GPU 的设备上训练,增加更多基准测试。 +* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增Tinker((https://thinkingmachines.ai/tinker/)) 后端以支持在 **无 GPU** 的设备上训练,增加更多基准测试,增强在线 RL 等功能。 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端,可在**无 GPU 的设备**上进行模型训练。 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务,让 AI 智能体能够理解模糊症状、主动询问后续问题,并提供精准推荐([新闻](https://tech.china.com.cn/sx/20251201/411376.shtml))。 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布:修复若干 Bug。 From ff6446bdaeba0fe41e88b0138a2f4a22c7f594ff Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 15:43:38 +0800 Subject: [PATCH 5/7] update news --- README_zh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README_zh.md b/README_zh.md index a88906c097..39ab7fe61e 100644 --- a/README_zh.md +++ b/README_zh.md @@ -41,7 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: ## 🚀 新闻 -* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增Tinker((https://thinkingmachines.ai/tinker/)) 后端以支持在 **无 GPU** 的设备上训练,增加更多基准测试,增强在线 RL 等功能。 +* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增[Tinker](https://thinkingmachines.ai/tinker/) 后端以支持在 **无 GPU** 的设备上训练,增加更多基准测试,增强在线 RL 等功能。 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端,可在**无 GPU 的设备**上进行模型训练。 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务,让 AI 智能体能够理解模糊症状、主动询问后续问题,并提供精准推荐([新闻](https://tech.china.com.cn/sx/20251201/411376.shtml))。 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布:修复若干 Bug。 From b51161f080c51a4d9044f744b69ade959bca761c Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 16:18:18 +0800 Subject: [PATCH 6/7] update readme --- README.md | 2 +- README_zh.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 93103d3227..9091074e95 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | Category | Tutorial / Guideline | | --- | ----| -| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html) | +| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)
+ [RL without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) | | *Multi-step agentic RL* | + [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)
+ [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)
+ [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)
+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)
+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html) | | *Algorithm development* | + [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))
+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | diff --git a/README_zh.md b/README_zh.md index 39ab7fe61e..c873977628 100644 --- a/README_zh.md +++ b/README_zh.md @@ -71,7 +71,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: | 类别 | 教程 / 指南 | | --- | ----| -| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html) | +| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)
+ [在无GPU环境下运行RL训练(Tinker 后端)](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) | | *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)
+ [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)
+ [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)
+ [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)
+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
+ [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [高级数据处理能力 & Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html) | | *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))
+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | From b1a96cd4b519d836bac49f913ec5023d64b21977 Mon Sep 17 00:00:00 2001 From: pxc Date: Tue, 30 Dec 2025 16:23:12 +0800 Subject: [PATCH 7/7] update readme --- README.md | 2 +- README_zh.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9091074e95..d0e319603a 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | Category | Tutorial / Guideline | | --- | ----| -| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)
+ [RL without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) | +| *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
+ [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
+ [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)
+ [RFT without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) | | *Multi-step agentic RL* | + [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)
+ [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)
+ [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)
+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)
+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html) | | *Algorithm development* | + [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))
+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | diff --git a/README_zh.md b/README_zh.md index c873977628..9197031c22 100644 --- a/README_zh.md +++ b/README_zh.md @@ -71,7 +71,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: | 类别 | 教程 / 指南 | | --- | ----| -| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)
+ [在无GPU环境下运行RL训练(Tinker 后端)](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) | +| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
+ [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
+ [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
+ [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)
+ [在无GPU环境下运行RFT训练(Tinker 后端)](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) | | *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)
+ [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)
+ [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)
+ [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)
+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
+ [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [高级数据处理能力 & Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html) | | *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))
+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |