-
Notifications
You must be signed in to change notification settings - Fork 14
Surface Drive reset boundaries as terminated/truncated signals #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 3.0
Are you sure you want to change the base?
Conversation
Greptile Summary
Important Files Changed
Confidence score: 4/5
Sequence DiagramsequenceDiagram
participant User as User
participant RL as PuffeRL
participant VE as VecEnv
participant CE as C Environment
participant CK as C Kernel
User->>RL: "step(actions)"
RL->>VE: "send(actions)"
VE->>CE: "c_step()"
CE->>CE: "move_dynamics()"
CE->>CE: "compute_agent_metrics()"
CE->>CE: "compute rewards"
alt "Horizon reached"
CE->>CE: "set truncations=1"
end
alt "Early termination"
CE->>CE: "set terminals=1"
end
alt "Reset boundary"
CE->>CE: "add_log()"
CE->>CE: "c_reset()"
end
VE->>CE: "recv()"
CE-->>VE: "obs, rewards, terminals, truncations"
VE-->>RL: "observations"
RL->>RL: "policy.forward_eval(obs)"
RL->>RL: "sample_logits()"
RL->>RL: "store experience"
alt "Batch full"
RL->>CK: "compute_puff_advantage()"
CK-->>RL: "advantages"
RL->>RL: "train policy"
end
RL-->>User: "statistics"
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
Thanks for opening this! This is correct and good. One thing that you can do to get the truncated in correctly is do what stable baselines does and just add the value truncation to the final reward so that you don't have to mess with the advantage computation? |
|
Quick comment, we have a rule in the repo that LLM-assisted PRs are marked as LLM-assisted PRs as I've found it helps us understand how to review it. Is that the case for this one? |
Thanks for the quick feedback and the great suggestion about handling truncation! I'll look into the Stable Baselines approach. Regarding the LLM assistance: The code changes in this PR were written manually by me. However, I did use Gemini to help draft and polish the PR description to make it clearer. So yes, it falls under the "LLM-assisted" category for the textual part. Thanks for the reminder! |
|
Thanks for the transparency! |
|
btw, it's also fairly common just to re-use the last state as the observation when computing the value function truncation. It's obviously not correct, but it's usuuuuually fine |
* Added WOSAC results on the 10k validation dataset * Code to evaluate SMART + associated doc * Edits. * Add link to docs. --------- Co-authored-by: Wael Boumediene Doulazmi <wbd2016@gl001.hpc.nyu.edu> Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
|
Updated:
I am currently running local tests to verify these fixes. I will update the PR status once validation is complete. |
|
I have completed a validation. Experiment Details:
Key Observations:
Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
pufferlib/pufferl.py, line 309-311 (link)style: The truncation bootstrap hack uses
values[..., l-1]as a proxy for terminal state value, but this may not be accurate sincel-1represents the previous step's value, not the actual terminal state value before reset. Consider whether this heuristic provides meaningful improvement over no bootstrapping. Have you validated that using the previous step's value as a proxy for terminal state value actually improves learning compared to no truncation bootstrapping?Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
4 files reviewed, 1 comment
|
Amazing! We've assigned two reviewers and are on it |
|
Thanks for the approval! It seems like the GitHub Actions workflows are still awaiting approval to run. Could you please help to trigger them? |
|
We are just trying to figure out if this makes it into the 2.0 release right now or goes into 3.0 so we are running experiments on some additional maps |
|
Understood. Thanks for the update! |
|
By the way, is there a rough timeline for the 3.0 release? I'm looking forward to using it to validate my algorithms. |
|
2.0 is out in the next few days and then we'll start planning the timing of 3.0! |
|
We are getting on this now, sorry for the delay |
| int agent_idx = env->active_agent_indices[i]; | ||
| int reached_goal = env->entities[agent_idx].metrics_array[REACHED_GOAL_IDX]; | ||
| if (reached_goal) { | ||
| env->terminals[i] = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the one thing we need to fix I think, in the multi-goal setting the environment does not terminate upon goal reaching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wrong! This is already handled.
This is a Draft PR and has not been validated (no training runs / regression checks yet). It is shared for discussion and as a reference implementation.
Relates to: #207
Background
In
puffer_drive, the C core can reset episodes at the configured horizon (episode_length=91) and can also reset early whentermination_mode=1. These resets were not propagated to the RL loop asterminated/truncated, making episode boundaries “invisible” and potentially corrupting bootstrapping/GAE.Separately, the Python wrapper periodically resamples by recreating envs and previously forced
terminals=1, which can create synchronized artificial terminal boundaries and amplify PPO diagnostic spikes (KL/clipfrac).What this PR changes
c_reset:truncations=1termination_mode=1) → setterminals=1truncationsbuffer intoenv->truncationsso the C core can write it.Drive):truncationseach step/reset.truncations=1(instead of forcingterminals=1).PuffeRL):(terminated OR truncated)as thedonemask for state resets and for storingterminalsinto the rollout buffer.truncationsseparately for potential future use.Current limitations / Notes
TODO (truncation distinction & why it’s hard)
terminatedvstruncatedin advantage/return logic:compute_puff_advantage) only accepts a singledonestensor, so it cannot express “truncate but bootstrap”.step(or expose terminal observation/value):c_resetinsidec_step, so the observation returned after a truncation boundary is already the next episode’s initial state.terminal_observation/terminal_value(or equivalent) so the learner can bootstrap from the pre-reset terminal state, not the post-reset initial state.done_mask(cuts GAE recursion) vsterminated_mask(controls whethergamma*V_{t+1}is zeroed).