Multi-Turn Training with VLMs by HwVanICI · Pull Request #931 · inclusionAI/AReaL

HwVanICI · 2026-02-19T23:22:25Z

Description

Currently, AReaL support multi-turn training only for LLMs. This PR adds comprehensive support for multi-turn agentic training of Vision-Language Models (VLMs) to AReaL. The implementation enables VLMs to learn from their mistakes through automatic error recovery and retry mechanisms, combined with turn-based reward discounting.

Changes

New Core Workflow

File: areal/workflow/vision_multiturn_agentic.py

A new workflow class VisionMultiTurnAgenticWorkflow that:

Integrates vision processing with multi-turn conversation handling
Implements automatic error recovery through failure feedback injection
Supports configurable turn limits and reward discounting
Enables tool calling for agentic reasoning
Compatible with HuggingFace vision processors (Qwen-VL series, etc.)

Key Features:

Error Recovery: When reward < 1.0, automatically appends feedback and retries (up to max_turns)
Turn Discounting: Applies exponential discount factor to incentivize correct first-turn answers
Vision Integration: Seamlessly processes multi-modal inputs (images + text)
Flexible Export: Supports both "concat" and "individual" training data export styles

New Example

Directory: examples/vlm_multiturn/

Complete training example with:

vlm_multiturn_grpo.py - Training script with Geometry3K reward function
vlm_multiturn_grpo.yaml - Full GRPO configuration for multi-turn VLM training
train_vlm_multiturn.sh - GPU training launcher script
train_vlm_multiturn_npu.sh - NPU training launcher script
README.md - Comprehensive documentation and usage guide

Related Issue

Fixes #(issue)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist · 2026-02-19T23:22:46Z

Summary of Changes

Hello @HwVanICI, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances AReaL's training framework by introducing robust multi-turn agentic capabilities for Vision-Language Models. It enables VLMs to engage in interactive problem-solving, learn from self-correction, and optimize for efficient responses through novel error recovery and reward discounting mechanisms. This expansion allows for more sophisticated and adaptive VLM training paradigms.

Highlights

Multi-Turn VLM Training: Added comprehensive support for multi-turn agentic training of Vision-Language Models (VLMs), extending AReaL's capabilities beyond LLMs.
Error Recovery and Retry: Implemented automatic error recovery and retry mechanisms, allowing VLMs to learn from mistakes by injecting failure feedback and retrying within a set number of turns.
Turn-Based Reward Discounting: Introduced turn-based reward discounting to incentivize models to provide correct answers more efficiently and in fewer turns.
New Workflow Class: Created VisionMultiTurnAgenticWorkflow to integrate multi-modal image processing, multi-turn conversation handling, and tool calling for agentic reasoning.
Complete Example: Provided a full training example in examples/vlm_multiturn/, including training scripts, GRPO configuration, and detailed documentation for GPU and NPU environments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

areal/workflow/vision_multiturn_agentic.py
- Added VisionMultiTurnAgenticWorkflow class for multi-turn VLM training.
- Implemented multi-modal image processing and integration with vision processors.
- Incorporated logic for automatic error recovery, feedback injection, and retries.
- Added support for configurable max_turns and turn_discount parameters.
- Enabled tool calling capabilities within the multi-turn interaction loop.
examples/vlm_multiturn/README.md
- Added new documentation detailing the multi-turn agentic VLM training workflow.
- Described key features such as error recovery, reward accumulation, and turn discounting.
- Provided quick start instructions and configuration examples for training.
examples/vlm_multiturn/train_vlm_multiturn.sh
- Added a new shell script to launch multi-turn VLM training on GPU.
- Configured environment variables, model paths, and multi-turn specific parameters.
examples/vlm_multiturn/train_vlm_multiturn_npu.sh
- Added a new shell script to launch multi-turn VLM training on NPU.
- Configured environment variables, model paths, and multi-turn specific parameters for NPU.
examples/vlm_multiturn/vlm_multiturn_grpo.py
- Added a new Python script for multi-turn agentic multi-modal RL training.
- Defined reward functions (format_reward, acc_reward, geometry3k_reward_fn) for VLM evaluation.
- Introduced VisionMultiTurnGRPOConfig dataclass to manage multi-turn specific training parameters.
- Integrated the new VisionMultiTurnAgenticWorkflow into the PPOTrainer.
examples/vlm_multiturn/vlm_multiturn_grpo.yaml
- Added a new YAML configuration file for multi-turn agentic multi-modal GRPO training.
- Included parameters for max_turns, turn_discount, and export_style.

Activity

No human activity (comments, reviews, etc.) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces multi-turn training support for Vision-Language Models (VLMs) in AReaL, enabling agentic interactions with error recovery and reward discounting. The changes include a new workflow class VisionMultiTurnAgenticWorkflow, along with example scripts and configuration files. The implementation appears robust, but there are a few areas for improvement regarding error handling, clarity, and consistency in the new workflow.

rchardx

See inline comments for details.

rchardx · 2026-02-26T04:51:54Z

+                len(seq),
+                len(resp.input_tokens[:-input_len]),
+            )
+            seq += resp.input_tokens[-input_len:] + resp.output_tokens


Picking up on @rchardx's earlier alignment comment — the input_len / assertion / slice pattern you adopted from MultiTurnWorkflow has a subtle precondition that I think might be broken here.

MultiTurnWorkflow initializes seq = [] (empty), which guarantees input_len > 0 on the first iteration. That makes list[-input_len:] slice off just the "new" tokens. Here, seq = input_ids.copy() (non-empty), so when resp.input_tokens matches exactly what was sent — which is the normal case since the engine does req = req.copy() before mutating (remote_inf_engine.py:690) — we get input_len = 0.

The problem is Python's list[-0:] semantics: it returns the entire list, not an empty slice. So:

# input_len == 0 (normal case for VLM with preprocessed tokens) seq += resp.input_tokens[-0:] + resp.output_tokens # ^^^^^^^^^^^^^^^^^^^^^^^^ # == resp.input_tokens[:] == ALL input tokens (Python quirk) # # seq was [input₁..inputₙ], now becomes [input₁..inputₙ, input₁..inputₙ, out₁..outₘ] # ← original → ← DUPLICATED →

Meanwhile logprobs, loss_mask, versions only grow by output_len, so len(seq) != len(logprobs) after the first iteration — which should cause a shape mismatch at tensor construction (lines 359-361).

The assertion on line 289 doesn't catch this because resp.input_tokens[:-0] also returns the full list, making the comparison trivially true.

Two possible fixes depending on the intended design:

(A) Match MultiTurnWorkflow exactly — start with empty seq:

seq, logprobs, loss_mask, versions = [], [], [], [] # ... and inside the loop: input_len = len(resp.input_tokens) - len(seq) # (rest stays the same, but add padding for input tokens) logprobs += [0.0] * input_len + resp.output_logprobs loss_mask += [0] * input_len + [1] * len(resp.output_tokens) versions += [-1] * input_len + resp.output_versions

(B) Keep pre-populated seq, guard against input_len == 0:

if input_len > 0: assert resp.input_tokens[:-input_len] == seq seq += resp.input_tokens[-input_len:] logprobs += [0.0] * input_len loss_mask += [0] * input_len versions += [-1] * input_len seq += resp.output_tokens logprobs += resp.output_logprobs loss_mask += [1] * len(resp.output_tokens) versions += resp.output_versions

Option A is simpler and follows the battle-tested convention. What do you think?

Thanks a lot for all of the comments. Unfortunately the current main branch has bugs in the VLM training ( in the serialization and data sharing) and I no longer can test this with the new changes. I will submit this PR for now to the ascend branch where VLM training still works.

rchardx · 2026-02-26T04:58:35Z

+            seq.extend(self.feedback_str_ids)
+
+            if messages_chat:
+                messages_chat = messages_chat + [self.feedback_str]


🐛 Type mismatch: raw string appended to list of message dicts

On line 327, model_output is correctly structured as a message dict:

model_output = {"role": "assistant", "content": [{"type": "text", "text": output_text}]} messages_chat = messages_chat + [model_output] # ✅ list of dicts

But here on line 333, self.feedback_str is a raw Python string (set in __init__ at line 119):

self.feedback_str = "Your answer is either wrong or not parsable to the reward function. Try to answer it again..."

So this appends a plain string to what should be a list of message dicts:

messages_chat = messages_chat + [self.feedback_str] # ❌ ["str"] not [{"role": ..., "content": ...}]

On the next turn, this corrupted messages_chat flows into ModelRequest.vision_msg_vllm (line 274), which reaches vllm_remote.py:62-63:

for msg in parsed_input: if isinstance(msg["content"], list): # 💥 str["content"] → TypeError

This only triggers when: (1) the dataset provides messages_chat (e.g., geometry3k does at line 156-180), (2) max_turns > 1, and (3) the first turn's reward < 1.0 (so the loop continues). The single-turn case (max_turns=1) or correct-on-first-try case never reaches this line, which might explain why it wasn't caught during testing.

Suggested fix — wrap feedback as a proper message dict, matching the assistant message pattern above:

if messages_chat: feedback_msg = { "role": "user", "content": [{"type": "text", "text": self.feedback_str}], } messages_chat = messages_chat + [feedback_msg]

This is consistent with how self.feedback_str is used in __init__ (line 120-125) when building feedback_str_ids — it's wrapped in {"role": "user", "content": [...]} there too.

What do you think? Am I reading the flow correctly?

Thanks a lot for all of the comments. Unfortunately the current main branch has bugs in the VLM training ( in the serialization and data sharing) and I no longer can test this with the new changes. I will submit this PR for now to the ascend branch where VLM training still works.

github-actions · 2026-03-17T01:53:16Z

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

github-actions · 2026-04-08T02:06:35Z

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

HwVanICI added 2 commits February 19, 2026 12:08

vlm_multiturn

289f5b5

vlm_multiturn

bb60569

gemini-code-assist Bot reviewed Feb 19, 2026

View reviewed changes

HwVanICI added 3 commits February 19, 2026 15:55

vlm_multiturn

2ee7b66

ruff fix

e14133f

ruff fix

a7b3201

rchardx reviewed Feb 24, 2026

View reviewed changes

Comment thread areal/workflow/vision_multiturn_agentic.py Outdated

rchardx reviewed Feb 24, 2026

View reviewed changes

Comment thread examples/vlm_multiturn/README.md

rchardx reviewed Feb 24, 2026

View reviewed changes

Comment thread areal/workflow/vision_multiturn.py

Comment thread examples/vlm_multiturn/vlm_multiturn_grpo.py

HwVanICI added 5 commits February 24, 2026 16:00

fix vision_multiturn bugs

2e4502d

fix vision_multiturn bugs

ddd99da

fix namings and comments in vision_multiturn

c3fd72b

ruff fix

cdd7d7f

ruff fix

0c7ac49

rchardx reviewed Feb 25, 2026

View reviewed changes

HwVanICI and others added 2 commits February 25, 2026 10:22

Merge branch 'inclusionAI:main' into vlm_multiturn

78bc2be

fix vison_multiturn reward, feedback tokenization and missing eos token

aa7ee93

rchardx reviewed Feb 26, 2026

View reviewed changes

HwVanICI added 2 commits February 26, 2026 12:08

Merge branch 'inclusionAI:main' into vlm_multiturn

98267e5

Merge branch 'inclusionAI:main' into vlm_multiturn

bc899d1

github-actions Bot added stale and removed stale labels Mar 17, 2026

github-actions Bot added the stale label Apr 8, 2026

garrett4wade self-requested a review as a code owner April 23, 2026 08:02

github-actions Bot removed the stale label Apr 24, 2026

Conversation

HwVanICI commented Feb 19, 2026

Description

Changes

New Core Workflow

New Example

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HwVanICI Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rchardx Feb 26, 2026

Choose a reason for hiding this comment

🐛 Type mismatch: raw string appended to list of message dicts

Uh oh!

HwVanICI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rchardx left a comment •

edited

Loading

rchardx Feb 26, 2026 •

edited

Loading

HwVanICI Mar 3, 2026 •

edited

Loading