Skip to content

Add MaxRL mean normalization over advantages#1126

Merged
SumanthRH merged 13 commits intoNovaSky-AI:mainfrom
tamoghnokandar:main
Mar 29, 2026
Merged

Add MaxRL mean normalization over advantages#1126
SumanthRH merged 13 commits intoNovaSky-AI:mainfrom
tamoghnokandar:main

Conversation

@tamoghnokandar
Copy link
Copy Markdown
Contributor

@tamoghnokandar tamoghnokandar commented Feb 15, 2026

Fixes #1030


Open with Devin

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
if len(id2score[index[i]]) > 1:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 MAXRL advantage sign is flipped when group mean reward is negative

In compute_maxrl_advantage, the normalization divides by (id2mean[index[i]] + epsilon) at line 1220. When the group's mean reward is negative, this denominator is negative, which flips the sign of the advantage. For example, with group scores [-10, -5] (mean = -7.5): the worse response (-10) gets advantage (-10 - (-7.5)) / (-7.5 + 1e-6) ≈ +0.333 (positive!) and the better response (-5) gets advantage (-5 - (-7.5)) / (-7.5 + 1e-6) ≈ -0.333 (negative!). This causes the RL algorithm to reinforce bad responses and penalize good ones whenever the group mean is negative. The test only uses positive rewards so it doesn't catch this. The fix should use abs(id2mean) in the denominator.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation is as per the original maxrl paper

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I make the denominator absolute then? Don't think people use negative rewards anyways nowadays

Comment thread skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py Outdated
@SumanthRH
Copy link
Copy Markdown
Member

SumanthRH commented Mar 9, 2026

Hi @tamoghnokandar .Thank you for the PR! We recently finished the migration from skyrl-train + skyrl-tx -> skyrl and are just now getting to open PRs!

Is this PR ready for review? Could you merge the latest changes from main if so?

@SumanthRH SumanthRH self-assigned this Mar 9, 2026
@tamoghnokandar
Copy link
Copy Markdown
Contributor Author

tamoghnokandar commented Mar 9, 2026

Hi @SumanthRH , I merged the latest changes from main. Ready for review now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this file modified? can you revert these changes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@tamoghnokandar tamoghnokandar Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is already restored? I have made it same as the current main branch.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah no issues, I reverted the change in this commit: 328943a

index=index,
)

expected = torch.tensor([1.5 / 4.5, -1.5 / 4.5, -1.5 / 10.5, 1.5 / 10.5]).unsqueeze(-1) * response_mask
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure expected includes 1e-6 in the denominator

Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left a couple minor comments.

Could you add an example script in examples/train/algorithms/maxrl for training on GSM8K with maxrl adv estimator?

Made-with: Cursor

# Conflicts:
#	tests/backends/skyrl_train/utils/test_ppo_utils.py
#	tests/tx/models/test_models_common.py
@tamoghnokandar
Copy link
Copy Markdown
Contributor Author

Done!

Comment thread examples/train/algorithms/maxrl/run_maxrl_gsm8k.sh
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH merged commit e804071 into NovaSky-AI:main Mar 29, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[algorithm] Add MaxRL mean normalization over advantages

2 participants