Set-up

Dependencies

This codebase has been tested with the packages and versions specified in requirements.txt and Python 3.8.

We recommend creating a new conda virtual environment:

conda create -n multimae python=3.8 -y
conda activate multimae

Then, install PyTorch 1.10.0+ and torchvision 0.11.1+. For example:

conda install pytorch=1.10.0 torchvision=0.11.1 -c pytorch -y

Finally, install all other required packages:

pip install timm==0.4.12 einops==0.3.2 pandas==1.3.4 albumentations==1.1.0 wandb==0.12.11

ℹ️ If data loading and image transforms are the bottleneck, consider replacing Pillow with Pillow-SIMD and compiling it with libjpeg-turbo. You can find a detailed guide on how to do this here or use the provided script:

sh tools/install_pillow_simd.sh

Dataset Preparation

Dataset structure

For simplicity and uniformity, all our datasets are structured in the following way:

/path/to/data/
├── train/
│   ├── modality1/
│   │   └── subfolder1/
│   │       ├── img1.ext1
│   │       └── img2.ext1
│   └── modality2/
│       └── subfolder1/
│           ├── img1.ext2
│           └── img2.ext2
└── val/
    ├── modality1/
    │   └── subfolder2/
    │       ├── img3.ext1
    │       └── img4.ext1
    └── modality2/
        └── subfolder2/
            ├── img3.ext2
            └── img4.ext2

The folder structure and filenames should match across modalities. If a dataset does not have specific subfolders, a generic subfolder name can be used instead (e.g., all/).

For most experiments, we use RGB (rgb), depth (depth), and semantic segmentation (semseg) as our modalities.

RGB images are stored as either PNG or JPEG images. Depth maps are stored as either single-channel JPX or single-channel PNG images. Semantic segmentation maps are stored as single-channel PNG images.

Datasets

We use the following datasets in our experiments:

ImageNet-1K
ADE20K
Hypersim
NYUv2
Taskonomy

To download these datasets, please follow the instructions on their respective pages. To extract semantic classes from NYUv2, follow the data preparations instructions from ShapeConv.

Pseudo labeling networks

We use two off-the-shelf networks to pseudo label the ImageNet-1K dataset.

Depth estimation: We use a DPT with a ViT-B-Hybrid backbone pre-trained on the Omnidata dataset. You can find installation instructions and pre-trained weights for this model here.
Semantic segmentation: We use a Mask2Former with a Swin-S backbone pre-trained on the COCO dataset. You can find installation instructions and pre-trained weights for this model here.

For an example of how to use these networks for pseudo labeling, please take a look at our Colab notebook.

ℹ️ The MultiMAE pre-training strategy is flexible and can benefit from higher quality pseudo labels and ground truth data. So feel free to use different pseudo labeling networks and datasets than the ones we used!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SETUP.md

SETUP.md

Set-up

Dependencies

Dataset Preparation

Dataset structure

Datasets

Pseudo labeling networks

Files

SETUP.md

Latest commit

History

SETUP.md

File metadata and controls

Set-up

Dependencies

Dataset Preparation

Dataset structure

Datasets

Pseudo labeling networks