Surface Drive reset boundaries as terminated/truncated signals #208

tlt18 · 2025-12-25T17:28:51Z

This is a Draft PR and has not been validated (no training runs / regression checks yet). It is shared for discussion and as a reference implementation.

Relates to: #207

Background

In puffer_drive, the C core can reset episodes at the configured horizon (episode_length=91) and can also reset early when termination_mode=1. These resets were not propagated to the RL loop as terminated/truncated, making episode boundaries “invisible” and potentially corrupting bootstrapping/GAE.

Separately, the Python wrapper periodically resamples by recreating envs and previously forced terminals=1, which can create synchronized artificial terminal boundaries and amplify PPO diagnostic spikes (KL/clipfrac).

What this PR changes

C core: On reset boundary, emit signals before calling c_reset:
- Horizon/time-limit reset → set truncations=1
- Early termination (e.g., termination_mode=1) → set terminals=1
Binding: Wire the NumPy truncations buffer into env->truncations so the C core can write it.
Python wrapper (Drive):
- Clear truncations each step/reset.
- Treat map resampling boundary as truncations=1 (instead of forcing terminals=1).
RL loop (PuffeRL):
- Use (terminated OR truncated) as the done mask for state resets and for storing terminals into the rollout buffer.
- Also store raw truncations separately for potential future use.

Current limitations / Notes

This PR improves correctness by making reset boundaries visible to the learner and preventing advantages from bootstrapping across hidden resets.
However, truncation is currently treated the same as done in advantage computation (no time-limit bootstrap distinction yet).

TODO (truncation distinction & why it’s hard)

Separate terminated vs truncated in advantage/return logic:
- Desired: on truncation, cut the trajectory, but still allow value bootstrap using the correct terminal state value.
- Current kernel (compute_puff_advantage) only accepts a single dones tensor, so it cannot express “truncate but bootstrap”.
Avoid auto-reset inside step (or expose terminal observation/value):
- Drive currently calls c_reset inside c_step, so the observation returned after a truncation boundary is already the next episode’s initial state.
- This makes “correct truncation bootstrap” ambiguous/incorrect unless we:
  - stop auto-reset in C and let the wrapper call reset, OR
  - provide terminal_observation / terminal_value (or equivalent) so the learner can bootstrap from the pre-reset terminal state, not the post-reset initial state.
Update PPO/GAE implementation to consume two masks:
- E.g., done_mask (cuts GAE recursion) vs terminated_mask (controls whether gamma*V_{t+1} is zeroed).
- Would require updating the CPU/CUDA kernels and the Python plumbing.

greptile-apps · 2025-12-25T17:30:32Z

Greptile Summary

Implements proper episode boundary signaling in PufferDrive by surfacing C core reset signals as terminated/truncated flags to the RL training loop, addressing an issue where episode boundaries were "invisible" and could corrupt GAE/bootstrapping calculations
Modifies the C environment binding to wire through truncation buffers, updates the drive environment to clear truncation arrays and treat map resampling as truncations, and changes the RL loop to use combined done masks while storing separate terminated/truncated signals
Adds a truncation bootstrap hack in the RL loop that attempts to preserve value bootstrapping on truncation steps by adding a heuristic gamma*V term to rewards when auto-reset occurs

Important Files Changed

Filename	Overview
`pufferlib/ocean/env_binding.h`	Uncommented two critical lines that wire NumPy truncations buffer to C environment structure, enabling C core to write truncation signals
`pufferlib/ocean/drive/drive.h`	Added truncations buffer to Drive struct and implemented episode boundary detection logic that signals time-limit resets as truncations
`pufferlib/ocean/drive/drive.py`	Added truncation array clearing on each step/reset and changed map resampling to signal truncations instead of artificial terminals
`pufferlib/pufferl.py`	Modified RL loop to create combined done mask from terminated+truncated flags and added truncation bootstrap hack for auto-reset environments

Confidence score: 4/5

This PR addresses a fundamental RL correctness issue with proper episode boundary signaling but includes a workaround hack that indicates incomplete solution
Score reflects the complexity of changes across C/Python boundary and the acknowledged limitations in the PR description regarding truncation bootstrap handling
Pay close attention to the truncation bootstrap hack in pufferlib/pufferl.py which uses heuristic values that may not represent true terminal state values

Sequence Diagram

sequenceDiagram
    participant User as User
    participant RL as PuffeRL
    participant VE as VecEnv
    participant CE as C Environment
    participant CK as C Kernel

    User->>RL: "step(actions)"
    RL->>VE: "send(actions)"
    VE->>CE: "c_step()"
    
    CE->>CE: "move_dynamics()"
    CE->>CE: "compute_agent_metrics()"
    CE->>CE: "compute rewards"
    
    alt "Horizon reached"
        CE->>CE: "set truncations=1"
    end
    
    alt "Early termination"
        CE->>CE: "set terminals=1"  
    end
    
    alt "Reset boundary"
        CE->>CE: "add_log()"
        CE->>CE: "c_reset()"
    end
    
    VE->>CE: "recv()"
    CE-->>VE: "obs, rewards, terminals, truncations"
    VE-->>RL: "observations"
    
    RL->>RL: "policy.forward_eval(obs)"
    RL->>RL: "sample_logits()"
    RL->>RL: "store experience"
    
    alt "Batch full"
        RL->>CK: "compute_puff_advantage()"
        CK-->>RL: "advantages"
        RL->>RL: "train policy"
    end
    
    RL-->>User: "statistics"

greptile-apps · 2025-12-25T17:30:33Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

eugenevinitsky · 2025-12-25T19:19:51Z

Thanks for opening this! This is correct and good. One thing that you can do to get the truncated in correctly is do what stable baselines does and just add the value truncation to the final reward so that you don't have to mess with the advantage computation?
Are you intending to complete the open issues in the PR? Just so we know how to review this.

eugenevinitsky · 2025-12-25T19:20:40Z

Quick comment, we have a rule in the repo that LLM-assisted PRs are marked as LLM-assisted PRs as I've found it helps us understand how to review it. Is that the case for this one?

tlt18 · 2025-12-25T19:38:01Z

Quick comment, we have a rule in the repo that LLM-assisted PRs are marked as LLM-assisted PRs as I've found it helps us understand how to review it. Is that the case for this one?

Thanks for the quick feedback and the great suggestion about handling truncation! I'll look into the Stable Baselines approach.

Regarding the LLM assistance: The code changes in this PR were written manually by me. However, I did use Gemini to help draft and polish the PR description to make it clearer. So yes, it falls under the "LLM-assisted" category for the textual part. Thanks for the reminder!

eugenevinitsky · 2025-12-25T22:30:35Z

Thanks for the transparency!

eugenevinitsky · 2025-12-25T22:31:25Z

btw, it's also fairly common just to re-use the last state as the observation when computing the value function truncation. It's obviously not correct, but it's usuuuuually fine

* Added WOSAC results on the 10k validation dataset * Code to evaluate SMART + associated doc * Edits. * Add link to docs. --------- Co-authored-by: Wael Boumediene Doulazmi <wbd2016@gl001.hpc.nyu.edu> Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>

…tstrap hack

tlt18 · 2025-12-26T05:54:43Z

Updated:

Individual respawns are now terminal.
Global early-termination is truncated.
Implemented the reward hack using values[l-1] (previous frame) to approximate the pre-reset state, as discussed.
Moved C-reset to the end of step to catch the final reward.

I am currently running local tests to verify these fixes. I will update the PR status once validation is complete.

tlt18 · 2025-12-26T12:01:30Z

I have completed a validation.

Experiment Details:

Command: puffer train puffer_drive --train.total-timesteps 200_000_000 (otherwise default settings)
Dataset: daphne-cornelisse/pufferdrive_womd_train

wandb reports

Key Observations:

Performance Improvement: The PR demonstrates significantly faster convergence and higher final performance across key metrics, including environment/score, completion_rate, and speed_at_goal. This suggests the agent is now correctly learning from the final transition data that was previously lost or miscalculated.
Training Stability: The training dynamics are noticeably more stable. The spikes (noise) in losses/value_loss and losses/policy_loss have been reduced compared to master, likely due to the corrected bootstrapping logic.
Note: losses/entropy still exhibits some high-frequency noise/spikes, which is consistent with the baseline behavior.

Ready for review!

greptile-apps

Additional Comments (1)

pufferlib/pufferl.py, line 309-311 (link)

style: The truncation bootstrap hack uses values[..., l-1] as a proxy for terminal state value, but this may not be accurate since l-1 represents the previous step's value, not the actual terminal state value before reset. Consider whether this heuristic provides meaningful improvement over no bootstrapping. Have you validated that using the previous step's value as a proxy for terminal state value actually improves learning compared to no truncation bootstrapping?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

eugenevinitsky · 2025-12-26T19:45:32Z

Amazing! We've assigned two reviewers and are on it

tlt18 · 2025-12-28T00:25:32Z

Thanks for the approval! It seems like the GitHub Actions workflows are still awaiting approval to run. Could you please help to trigger them?

eugenevinitsky · 2025-12-28T01:12:52Z

We are just trying to figure out if this makes it into the 2.0 release right now or goes into 3.0 so we are running experiments on some additional maps

tlt18 · 2025-12-28T01:21:11Z

Understood. Thanks for the update!

tlt18 · 2025-12-28T01:30:24Z

By the way, is there a rough timeline for the 3.0 release? I'm looking forward to using it to validate my algorithms.

eugenevinitsky · 2025-12-28T07:37:18Z

2.0 is out in the next few days and then we'll start planning the timing of 3.0!

eugenevinitsky · 2026-01-03T19:44:53Z

We are getting on this now, sorry for the delay

eugenevinitsky · 2026-01-04T00:25:19Z

pufferlib/ocean/drive/drive.h

            int agent_idx = env->active_agent_indices[i];
            int reached_goal = env->entities[agent_idx].metrics_array[REACHED_GOAL_IDX];
            if (reached_goal) {
+                env->terminals[i] = 1;


this is the one thing we need to fix I think, in the multi-goal setting the environment does not terminate upon goal reaching

I'm wrong! This is already handled.

Fix Drive reset done/truncation signals

f60b516

WaelDLZ and others added 4 commits December 25, 2025 23:08

fix: treat early termination as truncation; add truncation reward boo…

60d105a

…tstrap hack

fix: move time-limit reset to end-of-step

a0424be

Merge commit 'd4dabdb6ece965ea0bffe144252275e36fdab34a'

70c62b3

tlt18 marked this pull request as draft December 26, 2025 05:48

tlt18 changed the title ~~Draft PR: Surface Drive reset boundaries as terminated/truncated signals (unverified)~~ Surface Drive reset boundaries as terminated/truncated signals Dec 26, 2025

tlt18 marked this pull request as ready for review December 26, 2025 12:03

greptile-apps bot reviewed Dec 26, 2025

View reviewed changes

eugenevinitsky assigned eugenevinitsky and riccardosavorgnan Dec 26, 2025

daphne-cornelisse self-requested a review December 27, 2025 18:30

daphne-cornelisse approved these changes Dec 27, 2025

View reviewed changes

Merge branch 'main' into main

ace715f

eugenevinitsky changed the base branch from main to 3.0-temp January 3, 2026 22:56

eugenevinitsky reviewed Jan 4, 2026

View reviewed changes

Surface Drive reset boundaries as terminated/truncated signals #208

Are you sure you want to change the base?

Surface Drive reset boundaries as terminated/truncated signals #208

Conversation

tlt18 commented Dec 25, 2025

Background

What this PR changes

Current limitations / Notes

TODO (truncation distinction & why it’s hard)

Uh oh!

greptile-apps bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot commented Dec 25, 2025

Greptile found no issues!

Uh oh!

eugenevinitsky commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugenevinitsky commented Dec 25, 2025

Uh oh!

tlt18 commented Dec 25, 2025

Uh oh!

eugenevinitsky commented Dec 25, 2025

Uh oh!

eugenevinitsky commented Dec 25, 2025

Uh oh!

tlt18 commented Dec 26, 2025

Uh oh!

tlt18 commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

eugenevinitsky commented Dec 26, 2025

Uh oh!

tlt18 commented Dec 28, 2025

Uh oh!

eugenevinitsky commented Dec 28, 2025

Uh oh!

tlt18 commented Dec 28, 2025

Uh oh!

tlt18 commented Dec 28, 2025

Uh oh!

eugenevinitsky commented Dec 28, 2025

Uh oh!

eugenevinitsky commented Jan 3, 2026

Uh oh!

eugenevinitsky Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

eugenevinitsky Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

greptile-apps bot commented Dec 25, 2025 •

edited

Loading

eugenevinitsky commented Dec 25, 2025 •

edited

Loading

tlt18 commented Dec 26, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading