The repository provide the official implementation of Behaviour Policy Optimization (BPO), an off-policy extension of the classic Proximal Policy Optimization (PPO) algorithm with provable variance reduction. The full details can be found in:
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning Alexander W. Goodall, Edwin Hamel-De le Court, Francesco Belardinelli arXiv: https://arxiv.org/abs/2511.10843
Disclaimer: This implementation is simplified from the full version used to collect the results provided in the paper.
Python 3.8+ is required but we recommend Python 3.10 (later Python versions may not be supported).
- Install conda, e.g., via anaconda.
- Clone the repo:
git clone https://github.com/sacktock/BPO.git
cd BPO- Create a conda virtual environment:
conda env create --name bpo --file conda-environment.yaml
conda activate bpo- Install dependencies:
pip install -r requirements.txtpython run.py --configs cartpole_ppo
Our implementation relies on JAX for GPU acceleration, which can be enabled for Linux or WSL sub systems.
- Linux x86_64/aarch64: jax and jaxlib
0.4.30should already be installed via therequirements.txt. You need to reinstall JAX based on your cuda driver compatibility. Do not use the-Uoption here!
pip install "jax[cuda12]"For 13+ cuda versions you may need to upgrade the jax and jaxlib installation.
-
Windows: GPU acceletartion is also supported (experimentally) on Windows WSL x86_64. We strongly recommend using Ubuntu 22.04 or similar. Follow the Linux x86_64/aarch64 instructions above.
-
MAC: we recommend JAX with CPU. No further action is required if you correctly followed the earlier steps.
All experiments are launched from the command line via run.py. Under the hood, run.py loads named configuration blocks from configs.yaml and merges them left -> right in the order you pass them to --configs. Later configs override earlier ones.
The basic pattern is:
python run.py --configs <base_config> [<override_config> ...] [flag overrides...]--configscan take one or more names, each corresponding to a top-level key inconfigs.yaml(e.g.,cartpole_ppo,mujoco_ppo_gsde,ppo_bpo_zero, ...).- After config merging, any remaining CLI arguments are parsed as overrides (e.g.,
--env.env_id ant,--run.seed 0, etc.).
A minimal PPO run (CartPole preset):
python run.py --configs cartpole_ppoThis uses the cartpole_ppo block (timesteps, env, PPO hyperparameters, etc.).
A MuJoCo PPO run (Ant baseline preset):
python run.py --configs mujoco_ppo --env.env_id ant --run.seed 0 --run.logdir runs/mujoco/ant/ppo_seed_0(mujoco_ppo sets MuJoCo-style PPO defaults; --env.env_id can be changed among ant, half_cheetah, hopper, walker_2d.)
Option A -> flip the flag directly
python run.py --configs mujoco_ppo_gsde --bpo True --env.env_id ant --run.seed 0Option B -> include a BPO config block
For example, ppo_bpo_zero is a small "add-on" config that sets bpo: True and applies BPO-specific settings (e.g., symlog_targets, polyak_tau, and a zero-norm-final Q-head).
python run.py --configs mujoco_ppo_gsde ppo_bpo_zero --env.env_id ant --run.seed 0The repo includes several ready-made BPO variants that mainly differ in importance-weight clipping (clip_rho, clip_c) and whether trajectory clipping is enabled (clip_traj).
Common presets:
ppo_bpo_zero->clip_rho=1.5,clip_c=1.5, plus "zero final norm" Q-headppo_bpo_zero1.0->clip_rho=1.0,clip_c=1.0ppo_bpo_zero1.0_1.5->clip_rho=1.5,clip_c=1.0ppo_bpo_zero1.0_1.4->clip_rho=1.4,clip_c=1.0ppo_bpo_zero1.0_traj->clip_rho=1.0,clip_c=1.0,clip_traj=Trueppo_bpo_zero1.5_traj->clip_rho=1.5,clip_c=1.5,clip_traj=True
So, to run the "rho/c = 1.0/1.0" setting you can do:
python run.py --configs mujoco_ppo_gsde ppo_bpo_zero1.0 --env.env_id ant --run.seed 0TensorBoard logging is supported by our implementation and is enabled in the following example:
python run.py \
--configs mujoco_ppo_gsde ppo_bpo_zero \
--env.env_id ant \
--tensorboard True \
--logdir "runs/mujoco/ant/ppo_gsde_fqe_zero_seed_0" \
--seed 0The TensorBoard logs can be accessed via the command line:
tensorboard --logdir runs/mujoco/ant/The verbose flag controls how much diagnostic information is logged during training. This affects both console output and the set of metrics written to TensorBoard / logs. There are three verbosity levels, with default set to 0.
The verbosity level can be changed via a command-line override:
--run.verbose <level>SImilarly, you can also override any nested field directly from the command line (after configs are merged). For example, to keep ppo_bpo_zero but change the clipping thresholds:
python run.py --configs mujoco_ppo_gsde ppo_bpo_zero \
--env.env_id ant \
--ppo.clip_rho 1.0 --ppo.clip_c 1.0 \
--run.seed 0Out implementation of BPO is released under the MIT License.