Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
myant		myant
myspaceship		myspaceship
myswimmer		myswimmer
ppo		ppo
.gitkeep		.gitkeep
README.md		README.md
env_utils.py		env_utils.py
main_train.py		main_train.py
pandr_arguments.py		pandr_arguments.py
pandr_storage.py		pandr_storage.py
pandr_utils.py		pandr_utils.py
pdvf_networks.py		pdvf_networks.py
train_utils.py		train_utils.py

README.md

IJCAI 2022-Policy Adaptation via Decoupled Policy and Environment Representations (PAnDR)

This is the official implementation of our work PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations accepted on IJCAI 2022. A preliminary version is also presented at ICLR 2022 Workshop on Generalizable Policy Leanring (GPL).

Introduction

Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-online-adaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics.

In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation.

In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) (Raileanu et al., 2020) network is trained to approximate the values for different combinations of policies and environments.
In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent with respect to the PDVF.

A conceptual illustration is shown below.

Repo Content

Folder Description

ppo: implementation of PPO [Schulman et al., 2017], used to learn the policies of the training environments.
myant, myswimmer, myspaceship: enviornments used in the experiments, from PDVF [Raileanu et al., 2020].
env_utils.py: interact with the environment to collect data.
pandr_storage.py: store data.
pdvf_networks.py: network structure.
train_utils.py: optimizer of model.
pandr_arguments.py: parameters setting.
train_utils.py: model save and load.
main_train.py: contains training of encoder networks, training of value function network, policy optimization and testing.

Domains and Environments

Installation

We recommend the user to install anaconada and or venv for convenient management of different python envs.

Dependencies

Python
numpy
gym
mujoco-py
baselines 0.1.6
torch 1.7.1

Environment Installation

cd myant 
pip install -e .  

cd myswimmer 
pip install -e .  

cd myspaceship 
pip install -e .

Example Usage

(1) Offline Data Collection Phase

For offline experiences, we first train a PPO policy for each environment. Each of the commands below need to be run for default-ind in [0,...,14] (15 training environments). The obtained policies are training policies that serve as the collectors of offline data.

For each domain, 50-episode interaction trajectories are collected by each combination of training environment and training policy, e.g., 15 ∗ 15 ∗ 50 episodes of experiences collected in total for Ant-Wind.

Ant-wind

python ppo/ppo_main.py \
--env-name myant-v0 --default-ind 0 --seed 0

(2) PAnDR Training & Adaptation Phase

Ant-wind

# python main_train.py --env-name myant-v0 --op-lr 0.01 --num-t-policy-embed 50 --num-t-env-embed 1 --gd-iter 50 --norm-reward --min-reward -200 --max-reward 1000 --club-lambda 1000 --mi-lambda 1

We refer the user to our paper for complete details of hyperparameter settings and design choices.

TO-DO

Tidy up redundant codes

Citation

If this repository has helped your research, please cite the following:

@inproceedings{sang2022pandr,
  author    = {Tong Sang and
               Hongyao Tang and
               Yi Ma and
               Jianye Hao and
               Yan Zheng and
               Zhaopeng Meng and
               Boyan Li and
               Zhen Wang},
  title     = {PAnDR: Fast Adaptation to New Environments from Offline Experiences
               via Decoupling Policy and Environment Representations},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on
               Artificial Intelligence, {IJCAI} 2022},
  pages     = {TBD},
  year      = {2022},
  url       = {https://arxiv.org/abs/2204.02877}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAnDR

PAnDR

README.md

IJCAI 2022-Policy Adaptation via Decoupled Policy and Environment Representations (PAnDR)

Introduction

Repo Content

Folder Description

Domains and Environments

Installation

Dependencies

Environment Installation

Example Usage

(1) Offline Data Collection Phase

Ant-wind

(2) PAnDR Training & Adaptation Phase

Ant-wind

TO-DO

Citation

Files

PAnDR

Directory actions

More options

Directory actions

More options

Latest commit

History

PAnDR

Folders and files

parent directory

README.md

IJCAI 2022-Policy Adaptation via Decoupled Policy and Environment Representations (PAnDR)

Introduction

Repo Content

Folder Description

Domains and Environments

Installation

Dependencies

Environment Installation

Example Usage

(1) Offline Data Collection Phase

Ant-wind

(2) PAnDR Training & Adaptation Phase

Ant-wind

TO-DO

Citation