To pre-train ViT-Large (recommended default) with multi-node distributed training, run the following on 8 nodes with 8 GPUs each:
python submitit_pretrain.py \
--job_dir ${JOB_DIR} \
--nodes 8 \
--batch_size 64 \
--model mae_vit_large_patch16 \
--norm_pix_loss \
--mask_ratio 0.75 \
--epochs 800 \
--warmup_epochs 40 \
--blr 1.5e-4 --weight_decay 0.05 \
--data_path ${IMAGENET_DIR}
- Here the effective batch size is 64 (
batch_size
per gpu) * 8 (nodes
) * 8 (gpus per node) = 4096. If memory or # gpus is limited, use--accum_iter
to maintain the effective batch size, which isbatch_size
(per gpu) *nodes
* 8 (gpus per node) *accum_iter
. blr
is the base learning rate. The actuallr
is computed by the linear scaling rule:lr
=blr
* effective batch size / 256.- Here we use
--norm_pix_loss
as the target for better representation learning. To train a baseline model (e.g., for visualization), use pixel-based construction and turn off--norm_pix_loss
. - The exact same hyper-parameters and configs (initialization, augmentation, etc.) are used as our TF/TPU implementation. In our sanity checks, this PT/GPU re-implementation can reproduce the TF/TPU results within reasonable random variation. We get 85.5% fine-tuning accuracy by pre-training ViT-Large for 800 epochs (85.4% in paper Table 1d with TF/TPU).
- Training time is ~42h in 64 V100 GPUs (800 epochs).
To train ViT-Base or ViT-Huge, set --model vit_base_patch16
or --model vit_huge_patch14
.