Video-Language in, Actions out! No odometry, Dpeth, or Map required! This project contains the evaluation code of our RSS 2024 paper:
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation.
Contributors: Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang
[Paper & Appendices] [Projece Page]
r2r_results.mp4
To begin, you'll need to set up the Habitat VLN-CE environment. The first step is to install Habitat-sim (version 0.1.7) with Python 3.8. We recommend using Miniconda or Anaconda for easy installation:
conda create -n vlnce_navid python=3.8
conda activate vlnce_navid
conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
Next, install Haibtat-Lab 0.1.7:
mkdir navid_ws | cd navid_ws
git clone --branch v0.1.7 [email protected]:facebookresearch/habitat-lab.git
or
git clone --branch v0.1.7 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all
Finally, install NaVid:
cd ..
git clone [email protected]:jzhzhang/NaVid-VLN-CE.git
or
git clone https://github.com/jzhzhang/NaVid-VLN-CE.git
cd NaVid-VLN-CE
pip install -r requirements.txt
Follow the instructions in the VLN-CE Data Section to set up the scene dataset and episodes dataset. (If the RxR episodes cannot be accessed, you can download them here.) After completing the data preparation, update the data location in R2R config file and RxR config file. An example configuration is shown below, please modify the task files to align your data configuration:
NDTW:
GT_PATH: data/datasets/R2R_VLNCE_v1-3_preprocessed/{split}/{split}_gt.json.gz
DATASET:
TYPE: VLN-CE-v1 # for R2R
SPLIT: val_unseen
DATA_PATH: data/datasets/R2R_VLNCE_v1-3_preprocessed/{split}/{split}.json.gz # episodes dataset
SCENES_DIR: data/scene_datasets/ # scene datasets
First download the pretriend weights for vision encoder EVA-ViT-G. Then, download the finetuned NaVid model. The model has been trained on extensive samples from the training splits
of the VLN-CE R2R and RxR datasets, following the training strategy of Uni-NaVid.
Evaluation Benchmark | TL | NE | OS | SR | SPL |
---|---|---|---|---|---|
VLN-CE R2R Val. | 10.7 | 5.65 | 49.2 | 41.9 | 36.5 |
VLN-CE R2R Test | 11.3 | 5.39 | 52 | 45 | 39 |
VLN-CE RxR Val. | 15.4 | 5.72 | 55.6 | 45.7 | 38.2 |
We recommend organizing your project directory as follows
navid_ws
├── habitat-lab
├── NaVid-VLN-CE
│ ├── navid
│ ├── VLN_CE
│ ├── model_zoo
│ │ ├── eva_vit_g.pth
│ │ ├── <navid_weights>
To evaluate the model on multiple GPUs, use the provided evaluation eval_navid_vlnce.sh
script. Each GPU will handle one split of all episodes. Before running the script, modify the environment variables as follows:
CHUNKS=8 # GPU numbers
MODEL_PATH="" # model wieght
CONFIG_PATH="" # task configuration configure, see script for an example
SAVE_PATH="" # results
Run the script with:
bash eval_navid_vlnce.sh
Results will be saved in the specified SAVE_PATH
, which will include a log
directory and a video
directory. To monitor the results during the evaluation process, run:
watch -n 1 python analyze_results.py --path YOUR_RESULTS_PATH
To stop the evaluation, use:
bash kill_navid_eval.sh
If you find this work useful for your research, please consider citing:
@article{zhang2024navid,
title={NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation},
author={Zhang, Jiazhao and Wang, Kunyu and Xu, Rongtao and Zhou, Gengze and Hong, Yicong and Fang, Xiaomeng and Wu, Qi and Zhang, Zhizheng and Wang, He},
journal={Robotics: Science and Systems},
year={2024}
}
Our code is based on LLaMA-VID and VLN-CE.
This is an open-source version of NaVid, some functions have been rewritten to avoid certain license.
If you have any questions, feel free to email Jiazhao Zhang at [email protected].