Skip to content

Multi-Turn Training with VLMs#931

Open
HwVanICI wants to merge 14 commits intoinclusionAI:mainfrom
HwVanICI:vlm_multiturn
Open

Multi-Turn Training with VLMs#931
HwVanICI wants to merge 14 commits intoinclusionAI:mainfrom
HwVanICI:vlm_multiturn

Conversation

@HwVanICI
Copy link
Copy Markdown
Collaborator

Description

Currently, AReaL support multi-turn training only for LLMs. This PR adds comprehensive support for multi-turn agentic training of Vision-Language Models (VLMs) to AReaL. The implementation enables VLMs to learn from their mistakes through automatic error recovery and retry mechanisms, combined with turn-based reward discounting.

Changes

New Core Workflow

File: areal/workflow/vision_multiturn_agentic.py

A new workflow class VisionMultiTurnAgenticWorkflow that:

  • Integrates vision processing with multi-turn conversation handling
  • Implements automatic error recovery through failure feedback injection
  • Supports configurable turn limits and reward discounting
  • Enables tool calling for agentic reasoning
  • Compatible with HuggingFace vision processors (Qwen-VL series, etc.)

Key Features:

  • Error Recovery: When reward < 1.0, automatically appends feedback and retries (up to max_turns)
  • Turn Discounting: Applies exponential discount factor to incentivize correct first-turn answers
  • Vision Integration: Seamlessly processes multi-modal inputs (images + text)
  • Flexible Export: Supports both "concat" and "individual" training data export styles

New Example

Directory: examples/vlm_multiturn/

Complete training example with:

  • vlm_multiturn_grpo.py - Training script with Geometry3K reward function
  • vlm_multiturn_grpo.yaml - Full GRPO configuration for multi-turn VLM training
  • train_vlm_multiturn.sh - GPU training launcher script
  • train_vlm_multiturn_npu.sh - NPU training launcher script
  • README.md - Comprehensive documentation and usage guide

Related Issue

Fixes #(issue)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @HwVanICI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances AReaL's training framework by introducing robust multi-turn agentic capabilities for Vision-Language Models. It enables VLMs to engage in interactive problem-solving, learn from self-correction, and optimize for efficient responses through novel error recovery and reward discounting mechanisms. This expansion allows for more sophisticated and adaptive VLM training paradigms.

Highlights

  • Multi-Turn VLM Training: Added comprehensive support for multi-turn agentic training of Vision-Language Models (VLMs), extending AReaL's capabilities beyond LLMs.
  • Error Recovery and Retry: Implemented automatic error recovery and retry mechanisms, allowing VLMs to learn from mistakes by injecting failure feedback and retrying within a set number of turns.
  • Turn-Based Reward Discounting: Introduced turn-based reward discounting to incentivize models to provide correct answers more efficiently and in fewer turns.
  • New Workflow Class: Created VisionMultiTurnAgenticWorkflow to integrate multi-modal image processing, multi-turn conversation handling, and tool calling for agentic reasoning.
  • Complete Example: Provided a full training example in examples/vlm_multiturn/, including training scripts, GRPO configuration, and detailed documentation for GPU and NPU environments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/workflow/vision_multiturn_agentic.py
    • Added VisionMultiTurnAgenticWorkflow class for multi-turn VLM training.
    • Implemented multi-modal image processing and integration with vision processors.
    • Incorporated logic for automatic error recovery, feedback injection, and retries.
    • Added support for configurable max_turns and turn_discount parameters.
    • Enabled tool calling capabilities within the multi-turn interaction loop.
  • examples/vlm_multiturn/README.md
    • Added new documentation detailing the multi-turn agentic VLM training workflow.
    • Described key features such as error recovery, reward accumulation, and turn discounting.
    • Provided quick start instructions and configuration examples for training.
  • examples/vlm_multiturn/train_vlm_multiturn.sh
    • Added a new shell script to launch multi-turn VLM training on GPU.
    • Configured environment variables, model paths, and multi-turn specific parameters.
  • examples/vlm_multiturn/train_vlm_multiturn_npu.sh
    • Added a new shell script to launch multi-turn VLM training on NPU.
    • Configured environment variables, model paths, and multi-turn specific parameters for NPU.
  • examples/vlm_multiturn/vlm_multiturn_grpo.py
    • Added a new Python script for multi-turn agentic multi-modal RL training.
    • Defined reward functions (format_reward, acc_reward, geometry3k_reward_fn) for VLM evaluation.
    • Introduced VisionMultiTurnGRPOConfig dataclass to manage multi-turn specific training parameters.
    • Integrated the new VisionMultiTurnAgenticWorkflow into the PPOTrainer.
  • examples/vlm_multiturn/vlm_multiturn_grpo.yaml
    • Added a new YAML configuration file for multi-turn agentic multi-modal GRPO training.
    • Included parameters for max_turns, turn_discount, and export_style.
Activity
  • No human activity (comments, reviews, etc.) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces multi-turn training support for Vision-Language Models (VLMs) in AReaL, enabling agentic interactions with error recovery and reward discounting. The changes include a new workflow class VisionMultiTurnAgenticWorkflow, along with example scripts and configuration files. The implementation appears robust, but there are a few areas for improvement regarding error handling, clarity, and consistency in the new workflow.

Comment thread examples/vlm_multiturn/vlm_multiturn_grpo.py Outdated
Comment thread areal/workflow/vision_multiturn_agentic.py Outdated
Comment thread areal/workflow/vision_multiturn_agentic.py Outdated
Comment thread areal/workflow/vision_multiturn_agentic.py Outdated
Comment thread examples/vlm_multiturn/vlm_multiturn_grpo.py
Comment thread areal/workflow/vision_multiturn_agentic.py Outdated
Comment thread examples/vlm_multiturn/README.md
Comment thread areal/workflow/vision_multiturn.py
Comment thread examples/vlm_multiturn/vlm_multiturn_grpo.py
Copy link
Copy Markdown
Collaborator

@rchardx rchardx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments for details.

Comment thread areal/workflow/vision_multiturn.py Outdated
Comment thread examples/vlm_multiturn/vlm_multiturn_grpo.py Outdated
Comment thread examples/vlm_multiturn/vlm_multiturn_grpo.yaml Outdated
Comment thread examples/vlm_multiturn/train_vlm_multiturn.sh Outdated
Comment thread areal/workflow/vision_multiturn.py Outdated
Comment thread areal/workflow/vision_multiturn.py
Comment thread areal/workflow/vision_multiturn.py Outdated
Comment thread areal/workflow/vision_multiturn.py
Comment thread examples/vlm_multiturn/README.md Outdated
len(seq),
len(resp.input_tokens[:-input_len]),
)
seq += resp.input_tokens[-input_len:] + resp.output_tokens
Copy link
Copy Markdown
Collaborator

@rchardx rchardx Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking up on @rchardx's earlier alignment comment — the input_len / assertion / slice pattern you adopted from MultiTurnWorkflow has a subtle precondition that I think might be broken here.

MultiTurnWorkflow initializes seq = [] (empty), which guarantees input_len > 0 on the first iteration. That makes list[-input_len:] slice off just the "new" tokens. Here, seq = input_ids.copy() (non-empty), so when resp.input_tokens matches exactly what was sent — which is the normal case since the engine does req = req.copy() before mutating (remote_inf_engine.py:690) — we get input_len = 0.

The problem is Python's list[-0:] semantics: it returns the entire list, not an empty slice. So:

# input_len == 0 (normal case for VLM with preprocessed tokens)
seq += resp.input_tokens[-0:]  + resp.output_tokens
#      ^^^^^^^^^^^^^^^^^^^^^^^^
#      == resp.input_tokens[:]  == ALL input tokens (Python quirk)
#
# seq was [input₁..inputₙ], now becomes [input₁..inputₙ, input₁..inputₙ, out₁..outₘ]
#                                         ← original →    ← DUPLICATED →

Meanwhile logprobs, loss_mask, versions only grow by output_len, so len(seq) != len(logprobs) after the first iteration — which should cause a shape mismatch at tensor construction (lines 359-361).

The assertion on line 289 doesn't catch this because resp.input_tokens[:-0] also returns the full list, making the comparison trivially true.

Two possible fixes depending on the intended design:

(A) Match MultiTurnWorkflow exactly — start with empty seq:

seq, logprobs, loss_mask, versions = [], [], [], []
# ... and inside the loop:
input_len = len(resp.input_tokens) - len(seq)
# (rest stays the same, but add padding for input tokens)
logprobs += [0.0] * input_len + resp.output_logprobs
loss_mask += [0] * input_len + [1] * len(resp.output_tokens)
versions += [-1] * input_len + resp.output_versions

(B) Keep pre-populated seq, guard against input_len == 0:

if input_len > 0:
    assert resp.input_tokens[:-input_len] == seq
    seq += resp.input_tokens[-input_len:]
    logprobs += [0.0] * input_len
    loss_mask += [0] * input_len
    versions += [-1] * input_len
seq += resp.output_tokens
logprobs += resp.output_logprobs
loss_mask += [1] * len(resp.output_tokens)
versions += resp.output_versions

Option A is simpler and follows the battle-tested convention. What do you think?

Copy link
Copy Markdown
Collaborator Author

@HwVanICI HwVanICI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all of the comments. Unfortunately the current main branch has bugs in the VLM training ( in the serialization and data sharing) and I no longer can test this with the new changes. I will submit this PR for now to the ascend branch where VLM training still works.

seq.extend(self.feedback_str_ids)

if messages_chat:
messages_chat = messages_chat + [self.feedback_str]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Type mismatch: raw string appended to list of message dicts

On line 327, model_output is correctly structured as a message dict:

model_output = {"role": "assistant", "content": [{"type": "text", "text": output_text}]}
messages_chat = messages_chat + [model_output]  # ✅ list of dicts

But here on line 333, self.feedback_str is a raw Python string (set in __init__ at line 119):

self.feedback_str = "Your answer is either wrong or not parsable to the reward function. Try to answer it again..."

So this appends a plain string to what should be a list of message dicts:

messages_chat = messages_chat + [self.feedback_str]  # ❌ ["str"] not [{"role": ..., "content": ...}]

On the next turn, this corrupted messages_chat flows into ModelRequest.vision_msg_vllm (line 274), which reaches vllm_remote.py:62-63:

for msg in parsed_input:
    if isinstance(msg["content"], list):  # 💥 str["content"] → TypeError

This only triggers when: (1) the dataset provides messages_chat (e.g., geometry3k does at line 156-180), (2) max_turns > 1, and (3) the first turn's reward < 1.0 (so the loop continues). The single-turn case (max_turns=1) or correct-on-first-try case never reaches this line, which might explain why it wasn't caught during testing.

Suggested fix — wrap feedback as a proper message dict, matching the assistant message pattern above:

if messages_chat:
    feedback_msg = {
        "role": "user",
        "content": [{"type": "text", "text": self.feedback_str}],
    }
    messages_chat = messages_chat + [feedback_msg]

This is consistent with how self.feedback_str is used in __init__ (line 120-125) when building feedback_str_ids — it's wrapped in {"role": "user", "content": [...]} there too.

What do you think? Am I reading the flow correctly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all of the comments. Unfortunately the current main branch has bugs in the VLM training ( in the serialization and data sharing) and I no longer can test this with the new changes. I will submit this PR for now to the ascend branch where VLM training still works.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

@github-actions github-actions Bot added stale and removed stale labels Mar 17, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 8, 2026

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

@github-actions github-actions Bot added the stale label Apr 8, 2026
@garrett4wade garrett4wade self-requested a review as a code owner April 23, 2026 08:02
@github-actions github-actions Bot removed the stale label Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants