Conversation
Summary of ChangesHello @HwVanICI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances AReaL's training framework by introducing robust multi-turn agentic capabilities for Vision-Language Models. It enables VLMs to engage in interactive problem-solving, learn from self-correction, and optimize for efficient responses through novel error recovery and reward discounting mechanisms. This expansion allows for more sophisticated and adaptive VLM training paradigms. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces multi-turn training support for Vision-Language Models (VLMs) in AReaL, enabling agentic interactions with error recovery and reward discounting. The changes include a new workflow class VisionMultiTurnAgenticWorkflow, along with example scripts and configuration files. The implementation appears robust, but there are a few areas for improvement regarding error handling, clarity, and consistency in the new workflow.
| len(seq), | ||
| len(resp.input_tokens[:-input_len]), | ||
| ) | ||
| seq += resp.input_tokens[-input_len:] + resp.output_tokens |
There was a problem hiding this comment.
Picking up on @rchardx's earlier alignment comment — the input_len / assertion / slice pattern you adopted from MultiTurnWorkflow has a subtle precondition that I think might be broken here.
MultiTurnWorkflow initializes seq = [] (empty), which guarantees input_len > 0 on the first iteration. That makes list[-input_len:] slice off just the "new" tokens. Here, seq = input_ids.copy() (non-empty), so when resp.input_tokens matches exactly what was sent — which is the normal case since the engine does req = req.copy() before mutating (remote_inf_engine.py:690) — we get input_len = 0.
The problem is Python's list[-0:] semantics: it returns the entire list, not an empty slice. So:
# input_len == 0 (normal case for VLM with preprocessed tokens)
seq += resp.input_tokens[-0:] + resp.output_tokens
# ^^^^^^^^^^^^^^^^^^^^^^^^
# == resp.input_tokens[:] == ALL input tokens (Python quirk)
#
# seq was [input₁..inputₙ], now becomes [input₁..inputₙ, input₁..inputₙ, out₁..outₘ]
# ← original → ← DUPLICATED →Meanwhile logprobs, loss_mask, versions only grow by output_len, so len(seq) != len(logprobs) after the first iteration — which should cause a shape mismatch at tensor construction (lines 359-361).
The assertion on line 289 doesn't catch this because resp.input_tokens[:-0] also returns the full list, making the comparison trivially true.
Two possible fixes depending on the intended design:
(A) Match MultiTurnWorkflow exactly — start with empty seq:
seq, logprobs, loss_mask, versions = [], [], [], []
# ... and inside the loop:
input_len = len(resp.input_tokens) - len(seq)
# (rest stays the same, but add padding for input tokens)
logprobs += [0.0] * input_len + resp.output_logprobs
loss_mask += [0] * input_len + [1] * len(resp.output_tokens)
versions += [-1] * input_len + resp.output_versions(B) Keep pre-populated seq, guard against input_len == 0:
if input_len > 0:
assert resp.input_tokens[:-input_len] == seq
seq += resp.input_tokens[-input_len:]
logprobs += [0.0] * input_len
loss_mask += [0] * input_len
versions += [-1] * input_len
seq += resp.output_tokens
logprobs += resp.output_logprobs
loss_mask += [1] * len(resp.output_tokens)
versions += resp.output_versionsOption A is simpler and follows the battle-tested convention. What do you think?
There was a problem hiding this comment.
Thanks a lot for all of the comments. Unfortunately the current main branch has bugs in the VLM training ( in the serialization and data sharing) and I no longer can test this with the new changes. I will submit this PR for now to the ascend branch where VLM training still works.
| seq.extend(self.feedback_str_ids) | ||
|
|
||
| if messages_chat: | ||
| messages_chat = messages_chat + [self.feedback_str] |
There was a problem hiding this comment.
🐛 Type mismatch: raw string appended to list of message dicts
On line 327, model_output is correctly structured as a message dict:
model_output = {"role": "assistant", "content": [{"type": "text", "text": output_text}]}
messages_chat = messages_chat + [model_output] # ✅ list of dictsBut here on line 333, self.feedback_str is a raw Python string (set in __init__ at line 119):
self.feedback_str = "Your answer is either wrong or not parsable to the reward function. Try to answer it again..."So this appends a plain string to what should be a list of message dicts:
messages_chat = messages_chat + [self.feedback_str] # ❌ ["str"] not [{"role": ..., "content": ...}]On the next turn, this corrupted messages_chat flows into ModelRequest.vision_msg_vllm (line 274), which reaches vllm_remote.py:62-63:
for msg in parsed_input:
if isinstance(msg["content"], list): # 💥 str["content"] → TypeErrorThis only triggers when: (1) the dataset provides messages_chat (e.g., geometry3k does at line 156-180), (2) max_turns > 1, and (3) the first turn's reward < 1.0 (so the loop continues). The single-turn case (max_turns=1) or correct-on-first-try case never reaches this line, which might explain why it wasn't caught during testing.
Suggested fix — wrap feedback as a proper message dict, matching the assistant message pattern above:
if messages_chat:
feedback_msg = {
"role": "user",
"content": [{"type": "text", "text": self.feedback_str}],
}
messages_chat = messages_chat + [feedback_msg]This is consistent with how self.feedback_str is used in __init__ (line 120-125) when building feedback_str_ids — it's wrapped in {"role": "user", "content": [...]} there too.
What do you think? Am I reading the flow correctly?
There was a problem hiding this comment.
Thanks a lot for all of the comments. Unfortunately the current main branch has bugs in the VLM training ( in the serialization and data sharing) and I no longer can test this with the new changes. I will submit this PR for now to the ascend branch where VLM training still works.
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
Description
Currently, AReaL support multi-turn training only for LLMs. This PR adds comprehensive support for multi-turn agentic training of Vision-Language Models (VLMs) to AReaL. The implementation enables VLMs to learn from their mistakes through automatic error recovery and retry mechanisms, combined with turn-based reward discounting.
Changes
New Core Workflow
File:
areal/workflow/vision_multiturn_agentic.pyA new workflow class
VisionMultiTurnAgenticWorkflowthat:Key Features:
max_turns)New Example
Directory:
examples/vlm_multiturn/Complete training example with:
vlm_multiturn_grpo.py- Training script with Geometry3K reward functionvlm_multiturn_grpo.yaml- Full GRPO configuration for multi-turn VLM trainingtrain_vlm_multiturn.sh- GPU training launcher scripttrain_vlm_multiturn_npu.sh- NPU training launcher scriptREADME.md- Comprehensive documentation and usage guideRelated Issue
Fixes #(issue)
Type of Change
work as expected)
Checklist
jb build docs/gemini review)Breaking Change Details (if applicable):
Additional Context
Need help? Check the Contributing Guide or ask in
GitHub Discussions!