A flexible, modular framework based on Stable Baselines3 for training reinforcement learning (RL) and imitation learning (IL) agents across diverse robotic simulation environments and demonstration datasets.
This framework provides a unified interface for training policies using either simulation environments or demonstration datasets. The architecture ensures complete independence between data sources (environments vs datasets) and policy implementations, allowing seamless switching between training paradigms while maintaining consistent data processing pipelines.
Researchers working on:
- Robot learning algorithms
- Multi-task policy learning
- Sim-to-real transfer
- Comparative studies across environments/datasets
Practitioners needing:
- Rapid prototyping of robot policies
- Flexible experimentation with different architectures
- Unified training pipeline across multiple simulators
- Environment-agnostic: Works with Isaac Lab, ManiSkill, Aloha and other custom Gym environments
- Dataset-agnostic: Compatible with LeRobot datasets
- Shared data format ensures policies work seamlessly across sources
- Supervised Learning: Behavior cloning from demonstrations
- On-policy RL: PPO, Recurrent PPO, Transformer PPO (! efficient rollout buffer implementation !)
- Off-policy RL: SAC
- Distributed training with Accelerate
- Mixed precision training (FP16/BF16)
- Gradient accumulation
- Comprehensive logging (TensorBoard, Weights & Biases)
- Automatic checkpointing and resumption
- Extensive test suite
The framework follows a layered architecture that separates concerns:
-
Data Sources
- Environments: Isaac Lab, ManiSkill, Aloha (via SB3 Wrapper), or contribute implementing your own!
- Datasets: LeRobot demonstrations (via DS Wrapper)
- Both produce standardized observation/action dictionaries
-
Preprocessors
- Transform source-specific data formats to policy inputs
- Handle normalization, image processing, sequence formatting
- Examples:
Gym_2_Mlp,Gym_2_Lstm,Gym_2_Sac,Aloha_2_Lstm
-
Agents
- Manage training loops (rollout collection, batch optimization)
- Interface with policies through preprocessors
- Handle loss computation and gradient updates
- Examples:
PPO,RecurrentPPO,TransformerPPO,SAC,SL
-
Policies
- Neural network architectures (actor-critic or standalone)
- Independent of data source or training algorithm
- Examples:
MlpPolicy,LSTMPolicy,TransformerPolicy,SACPolicy,TCNPolicy
-
Entry Points
train.py: Online RL from simulationtrain_off.py: Offline IL from demonstrationspredict.py: Policy evaluation and deployment
Environment/Dataset → Wrapper → Preprocessor → Policy → Agent → Optimization
↓ ↓ ↓ ↓
Standardized Normalized Actions Loss
Format Inputs
- Python 3.11+
- CUDA 11.8+ (for GPU training)
# Clone repository
git clone https://github.com/johnMinelli/stable-baselines3-devkit
cd src
# Install dependencies
pip install -r requirements.txtTrain a PPO agent with MLP policy on Isaac Lab:
python train.py \
--task Isaac-Lift-Cube-Franka-v0 \
--envsim isaaclab \
--agent custom_ppo_mlp \
--num_envs 4096 \
--device cuda \
--headlessTrain a Recurrent PPO agent with LSTM policy:
python train.py \
--task Isaac-Velocity-Flat-Anymal-D-v0 \
--envsim isaaclab \
--agent custom_ppo_lstm \
--num_envs 2048 \
--device cuda \
--headlessOffline IL Training (Demonstrations required e.g. ManiSkill_StackCube-v1 )
Train an LSTM policy via behavior cloning:
python train_off.py \
--task SL \
--agent Lerobot/StackCube/lerobot_sl_lstm_cfg \
--device cuda \
--n_epochs 200 \
--batch_size 64Evaluate a trained policy:
python predict.py \
--task Isaac-Velocity-Flat-Anymal-D-v0 \
--envsim isaaclab \
--agent custom_ppo_mlp \
--num_envs 1 \
--val_episodes 100 \
--device cuda \
--resume
(opt) --checkpoint path/to/best_model.zipPolicies must inherit from BasePolicy and implement required methods:
from stable_baselines3.common.policies import BasePolicy
class CustomPolicy(BasePolicy):
def __init__(self, observation_space, action_space, lr_schedule, **kwargs):
super().__init__(observation_space, action_space, ...)
# Define networks
def forward(self, obs):
# Compute actions, values, log_probs
return actions, values, log_probs
def predict_values(self, obs):
# Compute state values
return valuesRegister in agent's policy_aliases:
class PPO(OnPolicyAlgorithm):
policy_aliases = {
"MlpPolicy": MlpPolicy,
"CustomPolicy": CustomPolicy,
}Preprocessors bridge data sources to policies:
from common.preprocessor import Preprocessor
class CustomPreprocessor(Preprocessor):
def preprocess(self, obs):
# Transform observations for policy
processed = self.normalize_observations(obs)
# Additional transformations
return processed
def postprocess(self, actions):
# Transform policy outputs for environment
return self.unnormalize_actions(actions)- Create environment-specific wrapper:
class NewEnvWrapper(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
# Setup observation/action spaces
def reset(self):
# Return standardized observations
def step(self, action):
# Return standardized obs, reward, done, info- Create preprocessor for the environment:
class NewEnv_2_Policy(Preprocessor):
# Implement preprocessing logic-
Add YAML configuration file in
configs/agents/NewEnv/ -
Update
train.pyimports:
if args_cli.envsim == "newenv":
import newenv_packageYes, You can! :)
If you use this framework in your research, please cite:
@misc{stable-baselines3-devkit,
title = {Stable Baselines3 DevKit},
author = {Giovanni Minelli},
year = {2026},
url = {https://github.com/johnMinelli/stable-baselines3-devkit}
}This framework builds upon:
