Add MaxRL mean normalization over advantages by tamoghnokandar · Pull Request #1126 · NovaSky-AI/SkyRL

tamoghnokandar · 2026-02-15T19:53:41Z

devin-ai-integration

Devin Review found 2 new potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-02-15T20:13:02Z

+                raise ValueError(f"no score in prompt index: {idx}")
+        for i in range(bsz):
+            if len(id2score[index[i]]) > 1:
+                scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)


🔴 MAXRL advantage sign is flipped when group mean reward is negative

In compute_maxrl_advantage, the normalization divides by (id2mean[index[i]] + epsilon) at line 1220. When the group's mean reward is negative, this denominator is negative, which flips the sign of the advantage. For example, with group scores [-10, -5] (mean = -7.5): the worse response (-10) gets advantage (-10 - (-7.5)) / (-7.5 + 1e-6) ≈ +0.333 (positive!) and the better response (-5) gets advantage (-5 - (-7.5)) / (-7.5 + 1e-6) ≈ -0.333 (negative!). This causes the RL algorithm to reinforce bad responses and penalize good ones whenever the group mean is negative. The test only uses positive rewards so it doesn't catch this. The fix should use abs(id2mean) in the denominator.

Was this helpful? React with 👍 or 👎 to provide feedback.

This formulation is as per the original maxrl paper

Should I make the denominator absolute then? Don't think people use negative rewards anyways nowadays

SumanthRH · 2026-03-09T16:02:34Z

Hi @tamoghnokandar .Thank you for the PR! We recently finished the migration from skyrl-train + skyrl-tx -> skyrl and are just now getting to open PRs!

Is this PR ready for review? Could you merge the latest changes from main if so?

tamoghnokandar · 2026-03-09T20:18:24Z

Hi @SumanthRH , I merged the latest changes from main. Ready for review now.

SumanthRH · 2026-03-29T05:08:32Z

why was this file modified? can you revert these changes?

@tamoghnokandar

I think it is already restored? I have made it same as the current main branch.

Yeah no issues, I reverted the change in this commit: 328943a

SumanthRH · 2026-03-29T05:10:18Z

+        index=index,
+    )
+
+    expected = torch.tensor([1.5 / 4.5, -1.5 / 4.5, -1.5 / 10.5, 1.5 / 10.5]).unsqueeze(-1) * response_mask


let's make sure expected includes 1e-6 in the denominator

SumanthRH

Looking good, left a couple minor comments.

Could you add an example script in examples/train/algorithms/maxrl for training on GSM8K with maxrl adv estimator?

Made-with: Cursor # Conflicts: # tests/backends/skyrl_train/utils/test_ppo_utils.py # tests/tx/models/test_models_common.py

tamoghnokandar · 2026-03-29T09:52:12Z

Done!

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

tamoghnokandar and others added 4 commits February 15, 2026 11:44

Add MaxRL

1b01e60

Update pyproject.toml

6e8194f

Update pyproject.toml

78e2e91

Update pyproject.toml

e30405e

This comment was marked as resolved.

Sign in to view

Fix the case when mean is 0

162b9c6

devin-ai-integration Bot reviewed Feb 15, 2026

View reviewed changes

tamoghnokandar added 3 commits February 15, 2026 12:16

Fix mean=0 error in backend

8e5420d

Fix tests

bd14d5e

Fix naming

59af7a7

SumanthRH self-assigned this Mar 9, 2026

Merge remote-tracking branch 'upstream/main'

acea01c

SumanthRH reviewed Mar 29, 2026

View reviewed changes

tamoghnokandar added 2 commits March 29, 2026 02:39

Address reviewer comments

19e6f0b

Merge branch 'main' of https://github.com/NovaSky-AI/SkyRL

e21765a

Made-with: Cursor # Conflicts: # tests/backends/skyrl_train/utils/test_ppo_utils.py # tests/tx/models/test_models_common.py

SumanthRH reviewed Mar 29, 2026

View reviewed changes

Comment thread examples/train/algorithms/maxrl/run_maxrl_gsm8k.sh

Apply suggestion from @SumanthRH

03989fd

SumanthRH approved these changes Mar 29, 2026

View reviewed changes

revert unwanted changes

328943a

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

SumanthRH merged commit e804071 into NovaSky-AI:main Mar 29, 2026
5 of 7 checks passed

Conversation

tamoghnokandar commented Feb 15, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

tamoghnokandar Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SumanthRH commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tamoghnokandar commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SumanthRH Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

tamoghnokandar Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

tamoghnokandar commented Mar 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tamoghnokandar commented Feb 15, 2026 •

edited by devin-ai-integration Bot

Loading

devin-ai-integration Bot Feb 15, 2026 •

edited

Loading

SumanthRH commented Mar 9, 2026 •

edited

Loading

tamoghnokandar commented Mar 9, 2026 •

edited

Loading

tamoghnokandar Mar 29, 2026 •

edited

Loading