Skip to content

Conversation

@tlt18
Copy link

@tlt18 tlt18 commented Dec 25, 2025

This is a Draft PR and has not been validated (no training runs / regression checks yet). It is shared for discussion and as a reference implementation.

Relates to: #207

Background

In puffer_drive, the C core can reset episodes at the configured horizon (episode_length=91) and can also reset early when termination_mode=1. These resets were not propagated to the RL loop as terminated/truncated, making episode boundaries “invisible” and potentially corrupting bootstrapping/GAE.

Separately, the Python wrapper periodically resamples by recreating envs and previously forced terminals=1, which can create synchronized artificial terminal boundaries and amplify PPO diagnostic spikes (KL/clipfrac).

What this PR changes

  • C core: On reset boundary, emit signals before calling c_reset:
    • Horizon/time-limit reset → set truncations=1
    • Early termination (e.g., termination_mode=1) → set terminals=1
  • Binding: Wire the NumPy truncations buffer into env->truncations so the C core can write it.
  • Python wrapper (Drive):
    • Clear truncations each step/reset.
    • Treat map resampling boundary as truncations=1 (instead of forcing terminals=1).
  • RL loop (PuffeRL):
    • Use (terminated OR truncated) as the done mask for state resets and for storing terminals into the rollout buffer.
    • Also store raw truncations separately for potential future use.

Current limitations / Notes

  • This PR improves correctness by making reset boundaries visible to the learner and preventing advantages from bootstrapping across hidden resets.
  • However, truncation is currently treated the same as done in advantage computation (no time-limit bootstrap distinction yet).

TODO (truncation distinction & why it’s hard)

  • Separate terminated vs truncated in advantage/return logic:
    • Desired: on truncation, cut the trajectory, but still allow value bootstrap using the correct terminal state value.
    • Current kernel (compute_puff_advantage) only accepts a single dones tensor, so it cannot express “truncate but bootstrap”.
  • Avoid auto-reset inside step (or expose terminal observation/value):
    • Drive currently calls c_reset inside c_step, so the observation returned after a truncation boundary is already the next episode’s initial state.
    • This makes “correct truncation bootstrap” ambiguous/incorrect unless we:
      • stop auto-reset in C and let the wrapper call reset, OR
      • provide terminal_observation / terminal_value (or equivalent) so the learner can bootstrap from the pre-reset terminal state, not the post-reset initial state.
  • Update PPO/GAE implementation to consume two masks:
    • E.g., done_mask (cuts GAE recursion) vs terminated_mask (controls whether gamma*V_{t+1} is zeroed).
    • Would require updating the CPU/CUDA kernels and the Python plumbing.

@greptile-apps
Copy link

greptile-apps bot commented Dec 25, 2025

Greptile Summary

  • Implements proper episode boundary signaling in PufferDrive by surfacing C core reset signals as terminated/truncated flags to the RL training loop, addressing an issue where episode boundaries were "invisible" and could corrupt GAE/bootstrapping calculations
  • Modifies the C environment binding to wire through truncation buffers, updates the drive environment to clear truncation arrays and treat map resampling as truncations, and changes the RL loop to use combined done masks while storing separate terminated/truncated signals
  • Adds a truncation bootstrap hack in the RL loop that attempts to preserve value bootstrapping on truncation steps by adding a heuristic gamma*V term to rewards when auto-reset occurs

Important Files Changed

Filename Overview
pufferlib/ocean/env_binding.h Uncommented two critical lines that wire NumPy truncations buffer to C environment structure, enabling C core to write truncation signals
pufferlib/ocean/drive/drive.h Added truncations buffer to Drive struct and implemented episode boundary detection logic that signals time-limit resets as truncations
pufferlib/ocean/drive/drive.py Added truncation array clearing on each step/reset and changed map resampling to signal truncations instead of artificial terminals
pufferlib/pufferl.py Modified RL loop to create combined done mask from terminated+truncated flags and added truncation bootstrap hack for auto-reset environments

Confidence score: 4/5

  • This PR addresses a fundamental RL correctness issue with proper episode boundary signaling but includes a workaround hack that indicates incomplete solution
  • Score reflects the complexity of changes across C/Python boundary and the acknowledged limitations in the PR description regarding truncation bootstrap handling
  • Pay close attention to the truncation bootstrap hack in pufferlib/pufferl.py which uses heuristic values that may not represent true terminal state values

Sequence Diagram

sequenceDiagram
    participant User as User
    participant RL as PuffeRL
    participant VE as VecEnv
    participant CE as C Environment
    participant CK as C Kernel

    User->>RL: "step(actions)"
    RL->>VE: "send(actions)"
    VE->>CE: "c_step()"
    
    CE->>CE: "move_dynamics()"
    CE->>CE: "compute_agent_metrics()"
    CE->>CE: "compute rewards"
    
    alt "Horizon reached"
        CE->>CE: "set truncations=1"
    end
    
    alt "Early termination"
        CE->>CE: "set terminals=1"  
    end
    
    alt "Reset boundary"
        CE->>CE: "add_log()"
        CE->>CE: "c_reset()"
    end
    
    VE->>CE: "recv()"
    CE-->>VE: "obs, rewards, terminals, truncations"
    VE-->>RL: "observations"
    
    RL->>RL: "policy.forward_eval(obs)"
    RL->>RL: "sample_logits()"
    RL->>RL: "store experience"
    
    alt "Batch full"
        RL->>CK: "compute_puff_advantage()"
        CK-->>RL: "advantages"
        RL->>RL: "train policy"
    end
    
    RL-->>User: "statistics"
Loading

@greptile-apps
Copy link

greptile-apps bot commented Dec 25, 2025

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@eugenevinitsky
Copy link

eugenevinitsky commented Dec 25, 2025

Thanks for opening this! This is correct and good. One thing that you can do to get the truncated in correctly is do what stable baselines does and just add the value truncation to the final reward so that you don't have to mess with the advantage computation?
Are you intending to complete the open issues in the PR? Just so we know how to review this.

@eugenevinitsky
Copy link

Quick comment, we have a rule in the repo that LLM-assisted PRs are marked as LLM-assisted PRs as I've found it helps us understand how to review it. Is that the case for this one?

@tlt18
Copy link
Author

tlt18 commented Dec 25, 2025

Quick comment, we have a rule in the repo that LLM-assisted PRs are marked as LLM-assisted PRs as I've found it helps us understand how to review it. Is that the case for this one?

Thanks for the quick feedback and the great suggestion about handling truncation! I'll look into the Stable Baselines approach.

Regarding the LLM assistance: The code changes in this PR were written manually by me. However, I did use Gemini to help draft and polish the PR description to make it clearer. So yes, it falls under the "LLM-assisted" category for the textual part. Thanks for the reminder!

@eugenevinitsky
Copy link

Thanks for the transparency!

@eugenevinitsky
Copy link

btw, it's also fairly common just to re-use the last state as the observation when computing the value function truncation. It's obviously not correct, but it's usuuuuually fine

WaelDLZ and others added 4 commits December 25, 2025 23:08
* Added WOSAC results on the 10k validation dataset

* Code to evaluate SMART + associated doc

* Edits.

* Add link to docs.

---------

Co-authored-by: Wael Boumediene Doulazmi <wbd2016@gl001.hpc.nyu.edu>
Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
@tlt18 tlt18 marked this pull request as draft December 26, 2025 05:48
@tlt18
Copy link
Author

tlt18 commented Dec 26, 2025

Updated:

  • Individual respawns are now terminal.

  • Global early-termination is truncated.

  • Implemented the reward hack using values[l-1] (previous frame) to approximate the pre-reset state, as discussed.

  • Moved C-reset to the end of step to catch the final reward.

I am currently running local tests to verify these fixes. I will update the PR status once validation is complete.

@tlt18
Copy link
Author

tlt18 commented Dec 26, 2025

I have completed a validation.

Experiment Details:

wandb reports

Key Observations:

  • Performance Improvement: The PR demonstrates significantly faster convergence and higher final performance across key metrics, including environment/score, completion_rate, and speed_at_goal. This suggests the agent is now correctly learning from the final transition data that was previously lost or miscalculated.

  • Training Stability: The training dynamics are noticeably more stable. The spikes (noise) in losses/value_loss and losses/policy_loss have been reduced compared to master, likely due to the corrected bootstrapping logic.

  • Note: losses/entropy still exhibits some high-frequency noise/spikes, which is consistent with the baseline behavior.

Ready for review!

@tlt18 tlt18 changed the title Draft PR: Surface Drive reset boundaries as terminated/truncated signals (unverified) Surface Drive reset boundaries as terminated/truncated signals Dec 26, 2025
@tlt18 tlt18 marked this pull request as ready for review December 26, 2025 12:03
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. pufferlib/pufferl.py, line 309-311 (link)

    style: The truncation bootstrap hack uses values[..., l-1] as a proxy for terminal state value, but this may not be accurate since l-1 represents the previous step's value, not the actual terminal state value before reset. Consider whether this heuristic provides meaningful improvement over no bootstrapping. Have you validated that using the previous step's value as a proxy for terminal state value actually improves learning compared to no truncation bootstrapping?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@eugenevinitsky
Copy link

Amazing! We've assigned two reviewers and are on it

@tlt18
Copy link
Author

tlt18 commented Dec 28, 2025

Thanks for the approval! It seems like the GitHub Actions workflows are still awaiting approval to run. Could you please help to trigger them?

@eugenevinitsky
Copy link

We are just trying to figure out if this makes it into the 2.0 release right now or goes into 3.0 so we are running experiments on some additional maps

@tlt18
Copy link
Author

tlt18 commented Dec 28, 2025

Understood. Thanks for the update!

@tlt18
Copy link
Author

tlt18 commented Dec 28, 2025

By the way, is there a rough timeline for the 3.0 release? I'm looking forward to using it to validate my algorithms.

@eugenevinitsky
Copy link

2.0 is out in the next few days and then we'll start planning the timing of 3.0!

@eugenevinitsky
Copy link

We are getting on this now, sorry for the delay

@eugenevinitsky eugenevinitsky changed the base branch from main to 3.0-temp January 3, 2026 22:56
int agent_idx = env->active_agent_indices[i];
int reached_goal = env->entities[agent_idx].metrics_array[REACHED_GOAL_IDX];
if (reached_goal) {
env->terminals[i] = 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the one thing we need to fix I think, in the multi-goal setting the environment does not terminate upon goal reaching

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wrong! This is already handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants