This is the official implementation of our work PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations accepted on IJCAI 2022. A preliminary version is also presented at ICLR 2022 Workshop on Generalizable Policy Leanring (GPL).
Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-online-adaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics.
In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation.
- In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) (Raileanu et al., 2020) network is trained to approximate the values for different combinations of policies and environments.
- In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent with respect to the PDVF.
A conceptual illustration is shown below.
- ppo: implementation of PPO [Schulman et al., 2017], used to learn the policies of the training environments.
- myant, myswimmer, myspaceship: enviornments used in the experiments, from PDVF [Raileanu et al., 2020].
- env_utils.py: interact with the environment to collect data.
- pandr_storage.py: store data.
- pdvf_networks.py: network structure.
- train_utils.py: optimizer of model.
- pandr_arguments.py: parameters setting.
- train_utils.py: model save and load.
- main_train.py: contains training of encoder networks, training of value function network, policy optimization and testing.
We recommend the user to install anaconada and or venv for convenient management of different python envs.
- Python
- numpy
- gym
- mujoco-py
- baselines 0.1.6
- torch 1.7.1
cd myant
pip install -e .
cd myswimmer
pip install -e .
cd myspaceship
pip install -e .
For offline experiences, we first train a PPO policy for each environment. Each of the commands below need to be run for default-ind in [0,...,14] (15 training environments). The obtained policies are training policies that serve as the collectors of offline data.
For each domain, 50-episode interaction trajectories are collected by each combination of training environment and training policy, e.g., 15 ∗ 15 ∗ 50 episodes of experiences collected in total for Ant-Wind.
python ppo/ppo_main.py \
--env-name myant-v0 --default-ind 0 --seed 0
# python main_train.py --env-name myant-v0 --op-lr 0.01 --num-t-policy-embed 50 --num-t-env-embed 1 --gd-iter 50 --norm-reward --min-reward -200 --max-reward 1000 --club-lambda 1000 --mi-lambda 1
We refer the user to our paper for complete details of hyperparameter settings and design choices.
- Tidy up redundant codes
If this repository has helped your research, please cite the following:
@inproceedings{sang2022pandr,
author = {Tong Sang and
Hongyao Tang and
Yi Ma and
Jianye Hao and
Yan Zheng and
Zhaopeng Meng and
Boyan Li and
Zhen Wang},
title = {PAnDR: Fast Adaptation to New Environments from Offline Experiences
via Decoupling Policy and Environment Representations},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI} 2022},
pages = {TBD},
year = {2022},
url = {https://arxiv.org/abs/2204.02877}
}