feat(ppo): add optional IcePop-style importance-ratio token filtering#1061
feat(ppo): add optional IcePop-style importance-ratio token filtering#1061RedDreamer wants to merge 3 commits intoinclusionAI:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the PPO trainer by incorporating IcePop-style importance ratio token filtering. This feature aims to improve the stability and performance of reinforcement learning models by addressing the common issue of training-inference mismatch, ensuring that only tokens within a specified importance ratio range contribute to the loss calculation. The change provides new configuration options for fine-tuning this behavior. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an optional IcePop-style importance-ratio token filtering mechanism. The changes include adding configuration parameters in cli_args.py, integrating these parameters into the PPO actor's logging and loss function (actor.py), and implementing the core filtering logic in a new helper function. The new feature enhances the PPO training process by mitigating training-inference mismatch. Overall, the implementation is clear and follows the described approach, but there are a couple of areas for improvement related to code clarity and potential unintended gradient flow.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Thanks for your PR, however,, |
|
Thank you for the contribution! While the ice pop feature is great, there are too many filtering approaches (going to be) implemented within AReaL and things are going wild. We'd better devise a unified and principled approach, rather than adding several new options for each type of filtering. For more details, see #1052 |
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
|
This feature should have been integrated in #1088 |
Description
Implements IcePop masking based on the method described in Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model. (https://arxiv.org/pdf/2510.18855)
This change adds token-level discrepancy masking/clipping to mitigate training–inference mismatch during RL training, following the paper’s IcePop approach.
https://arxiv.org/pdf/2510.18855
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
Additional Context
Need help? Check the Contributing Guide or ask in
GitHub Discussions!