Skip to content

Releases: huggingface/pytorch-image-models

More 3rd party ViT / ViT-hybrid weights

17 Aug 18:45
Compare
Choose a tag to compare

More weights for 3rd party ViT / ViT-CNN hybrids that needed remapping / re-hosting

EfficientFormer

Rehosted and remaped checkpoints from https://github.com/snap-research/EfficientFormer (originals in Google Drive)

GCViT

Heavily remaped from originals at https://github.com/NVlabs/GCVit due to from-scratch re-write of model code

NOTE: these checkpoints have a non-commercial CC-BY-NC-SA-4.0 license.

v0.6.7 Release

27 Jul 21:12
Compare
Choose a tag to compare

Minor bug fixes and a few more weights since 0.6.5

  • A few more weights & model defs added:
    • darknetaa53 - 79.8 @ 256, 80.5 @ 288
    • convnext_nano - 80.8 @ 224, 81.5 @ 288
    • cs3sedarknet_l - 81.2 @ 256, 81.8 @ 288
    • cs3darknet_x - 81.8 @ 256, 82.2 @ 288
    • cs3sedarknet_x - 82.2 @ 256, 82.7 @ 288
    • cs3edgenet_x - 82.2 @ 256, 82.7 @ 288
    • cs3se_edgenet_x - 82.8 @ 256, 83.5 @ 320
  • cs3* weights above all trained on TPU w/ bits_and_tpu branch. Thanks to TRC program!
  • Add output_stride=8 and 16 support to ConvNeXt (dilation)
  • deit3 models not being able to resize pos_emb fixed

v0.6.5 Release

10 Jul 23:53
Compare
Choose a tag to compare

First official release in a long while (since 0.5.4). All change log since 0.5.4 below,

July 8, 2022

More models, more fixes

  • Official research models (w/ weights) added:
  • My own models:
    • Small ResNet defs added by request with 1 block repeats for both basic and bottleneck (resnet10 and resnet14)
    • CspNet refactored with dataclass config, simplified CrossStage3 (cs3) option. These are closer to YOLO-v5+ backbone defs.
    • More relative position vit fiddling. Two srelpos (shared relative position) models trained, and a medium w/ class token.
    • Add an alternate downsample mode to EdgeNeXt and train a small model. Better than original small, but not their new USI trained weights.
  • My own model weight results (all ImageNet-1k training)
    • resnet10t - 66.5 @ 176, 68.3 @ 224
    • resnet14t - 71.3 @ 176, 72.3 @ 224
    • resnetaa50 - 80.6 @ 224 , 81.6 @ 288
    • darknet53 - 80.0 @ 256, 80.5 @ 288
    • cs3darknet_m - 77.0 @ 256, 77.6 @ 288
    • cs3darknet_focus_m - 76.7 @ 256, 77.3 @ 288
    • cs3darknet_l - 80.4 @ 256, 80.9 @ 288
    • cs3darknet_focus_l - 80.3 @ 256, 80.9 @ 288
    • vit_srelpos_small_patch16_224 - 81.1 @ 224, 82.1 @ 320
    • vit_srelpos_medium_patch16_224 - 82.3 @ 224, 83.1 @ 320
    • vit_relpos_small_patch16_cls_224 - 82.6 @ 224, 83.6 @ 320
    • edgnext_small_rw - 79.6 @ 224, 80.4 @ 320
  • cs3, darknet, and vit_*relpos weights above all trained on TPU thanks to TRC program! Rest trained on overheating GPUs.
  • Hugging Face Hub support fixes verified, demo notebook TBA
  • Pretrained weights / configs can be loaded externally (ie from local disk) w/ support for head adaptation.
  • Add support to change image extensions scanned by timm datasets/parsers. See (#1274 (comment))
  • Default ConvNeXt LayerNorm impl to use F.layer_norm(x.permute(0, 2, 3, 1), ...).permute(0, 3, 1, 2) via LayerNorm2d in all cases.
    • a bit slower than previous custom impl on some hardware (ie Ampere w/ CL), but overall fewer regressions across wider HW / PyTorch version ranges.
    • previous impl exists as LayerNormExp2d in models/layers/norm.py
  • Numerous bug fixes
  • Currently testing for imminent PyPi 0.6.x release
  • LeViT pretraining of larger models still a WIP, they don't train well / easily without distillation. Time to add distill support (finally)?
  • ImageNet-22k weight training + finetune ongoing, work on multi-weight support (slowly) chugging along (there are a LOT of weights, sigh) ...

May 13, 2022

  • Official Swin-V2 models and weights added from (https://github.com/microsoft/Swin-Transformer). Cleaned up to support torchscript.
  • Some refactoring for existing timm Swin-V2-CR impl, will likely do a bit more to bring parts closer to official and decide whether to merge some aspects.
  • More Vision Transformer relative position / residual post-norm experiments (all trained on TPU thanks to TRC program)
    • vit_relpos_small_patch16_224 - 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg pool
    • vit_relpos_medium_patch16_rpn_224 - 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg pool
    • vit_relpos_medium_patch16_224 - 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg pool
    • vit_relpos_base_patch16_gapcls_224 - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)
  • Bring 512 dim, 8-head 'medium' ViT model variant back to life (after using in a pre DeiT 'small' model for first ViT impl back in 2020)
  • Add ViT relative position support for switching btw existing impl and some additions in official Swin-V2 impl for future trials
  • Sequencer2D impl (https://arxiv.org/abs/2205.01972), added via PR from author (https://github.com/okojoalg)

May 2, 2022

  • Vision Transformer experiments adding Relative Position (Swin-V2 log-coord) (vision_transformer_relpos.py) and Residual Post-Norm branches (from Swin-V2) (vision_transformer*.py)
    • vit_relpos_base_patch32_plus_rpn_256 - 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg pool
    • vit_relpos_base_patch16_224 - 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg pool
    • vit_base_patch16_rpn_224 - 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg pool
  • Vision Transformer refactor to remove representation layer that was only used in initial vit and rarely used since with newer pretrain (ie How to Train Your ViT)
  • vit_* models support removal of class token, use of global average pool, use of fc_norm (ala beit, mae).

April 22, 2022

  • timm models are now officially supported in fast.ai! Just in time for the new Practical Deep Learning course. timmdocs documentation link updated to timm.fast.ai.
  • Two more model weights added in the TPU trained series. Some In22k pretrain still in progress.
    • seresnext101d_32x8d - 83.69 @ 224, 84.35 @ 288
    • seresnextaa101d_32x8d (anti-aliased w/ AvgPool2d) - 83.85 @ 224, 84.57 @ 288

March 23, 2022

  • Add ParallelBlock and LayerScale option to base vit models to support model configs in Three things everyone should know about ViT
  • convnext_tiny_hnf (head norm first) weights trained with (close to) A2 recipe, 82.2% top-1, could do better with more epochs.

March 21, 2022

  • Merge norm_norm_norm. IMPORTANT this update for a coming 0.6.x release will likely de-stabilize the master branch for a while. Branch 0.5.x or a previous 0.5.x release can be used if stability is required.
  • Significant weights update (all TPU trained) as described in this release
    • regnety_040 - 82.3 @ 224, 82.96 @ 288
    • regnety_064 - 83.0 @ 224, 83.65 @ 288
    • regnety_080 - 83.17 @ 224, 83.86 @ 288
    • regnetv_040 - 82.44 @ 224, 83.18 @ 288 (timm pre-act)
    • regnetv_064 - 83.1 @ 224, 83.71 @ 288 (timm pre-act)
    • regnetz_040 - 83.67 @ 256, 84.25 @ 320
    • regnetz_040h - 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)
    • resnetv2_50d_gn - 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)
    • resnetv2_50d_evos 80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)
    • regnetz_c16_evos - 81.9 @ 256, 82.64 @ 320 (EvoNormS)
    • regnetz_d8_evos - 83.42 @ 256, 84.04 @ 320 (EvoNormS)
    • xception41p - 82 @ 299 (timm pre-act)
    • xception65 - 83.17 @ 299
    • xception65p - 83.14 @ 299 (timm pre-act)
    • resnext101_64x4d - 82.46 @ 224, 83.16 @ 288
    • seresnext101_32x8d - 83.57 @ 224, 84.270 @ 288
    • resnetrs200 - 83.85 @ 256, 84.44 @ 320
  • HuggingFace hub support fixed w/ initial groundwork for allowing alternative 'config sources' for pretrained model definitions and weights (generic local file / remote url support soon)
  • SwinTransformer-V2 implementation added. Submitted by Christoph Reich. Training experiments and model changes by myself are ongoing so expect compat breaks.
  • Swin-S3 (AutoFormerV2) models / weights added from https://github.com/microsoft/Cream/tree/main/AutoFormerV2
  • MobileViT models w/ weights adapted from https://github.com/apple/ml-cvnets
  • PoolFormer models w/ weights adapted from https://github.com/sail-sg/poolformer
  • VOLO models w/ weights adapted from https://github.com/sail-sg/volo
  • Significant work experimenting with non-BatchNorm norm layers such as EvoNorm, FilterResponseNorm, GroupNorm, etc
  • Enhance support for alternate norm + act ('NormAct') layers added to a number of models, esp EfficientNet/MobileNetV3, RegNet, and aligned Xception
  • Grouped conv support added to EfficientNet family
  • Add 'group matching' API to all models to allow grouping model parameters for application of 'layer-wise' LR decay, lr scale added to LR scheduler
  • Gradient checkpointing support added to many models
  • forward_head(x, pre_logits=False) fn added to all models to allow separate calls of forward_features + forward_head
  • All vision transformer and vision MLP models update to return non-pooled / non-token selected features from foward_features, for consistency with CNN models, token selection or pooling now applied in forward_head

Feb 2, 2022

  • Chris Hughes posted an exhaustive run through of timm on his blog yesterday. Well worth a read. Getting Started with PyTorch Image Models (timm): A Practitioner’s Guide
  • I'm currently prepping to merge the norm_norm_norm branch back to master (ver 0.6.x) in next week or so.
    • The changes are more extensive than usual and may destabilize and break some model API use (aiming for full backwards compat). So, beware pip install git+https://github.com/rwightman/pytorch-image-models installs!
    • 0.5.x releases and a 0.5.x branch will remain stable with a cherry pick or two until dust clears. Recommend sticking to pypi install for a bit if you want stable.

Swin Transformer V2 (CR) weights and experiments

03 Apr 22:14
Compare
Choose a tag to compare

This release holds weights for timm's variant of Swin V2 (from @ChristophReich1996 impl, https://github.com/ChristophReich1996/Swin-Transformer-V2)

NOTE: ns variants of the models have extra norms on the main branch at the end of each stage, this seems to help training. The current small model is not using this, but currently training one. Will have a non-ns tiny soon as well as a comparsion. in21k and 1k base models are also in the works...

small checkpoints trained on TPU-VM instances via the TPU-Research Cloud (https://sites.research.google/trc/about/)

  • swin_v2_tiny_ns_224 - 81.80 top-1
  • swin_v2_small_224 - 83.13 top-1
  • swin_v2_small_ns_224 - 83.5 top-1

TPU VM trained weight release w/ PyTorch XLA

18 Mar 22:50
7c67d6a
Compare
Choose a tag to compare

A wide range of mid-large sized models trained in PyTorch XLA on TPU VM instances. Demonstrating viability of the TPU + PyTorch combo for excellent image model results. All models trained w/ the bits_and_tpu branch of this codebase.

A big thanks to the TPU Research Cloud (https://sites.research.google/trc/about/) for the compute used in these experiments.

This set includes several novel weights, including EvoNorm-S RegNetZ (C/D timm variants) and ResNet-V2 model experiments, as well as custom pre-activation model variants of RegNet-Y (called RegNet-V) and Xception (Xception-P) models.

Many if not all of the included RegNet weights surpass original paper results by a wide margin and remain above other known results (e.g. recent torchvision updates) in ImageNet-1k validation and especially OOD test set / robustness performance and scaling to higher resolutions.

RegNets

  • regnety_040 - 82.3 @ 224, 82.96 @ 288
  • regnety_064 - 83.0 @ 224, 83.65 @ 288
  • regnety_080 - 83.17 @ 224, 83.86 @ 288
  • regnetv_040 - 82.44 @ 224, 83.18 @ 288 (timm pre-act)
  • regnetv_064 - 83.1 @ 224, 83.71 @ 288 (timm pre-act)
  • regnetz_040 - 83.67 @ 256, 84.25 @ 320
  • regnetz_040h - 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)

Alternative norm layers (no BN!)

  • resnetv2_50d_gn - 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)
  • resnetv2_50d_evos 80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)
  • regnetz_c16_evos - 81.9 @ 256, 82.64 @ 320 (EvoNormS)
  • regnetz_d8_evos - 83.42 @ 256, 84.04 @ 320 (EvoNormS)

Xception redux

  • xception41p - 82 @ 299 (timm pre-act)
  • xception65 - 83.17 @ 299
  • xception65p - 83.14 @ 299 (timm pre-act)

ResNets (w/ SE and/or NeXT)

  • resnext101_64x4d - 82.46 @ 224, 83.16 @ 288
  • seresnext101_32x8d - 83.57 @ 224, 84.27 @ 288
  • seresnext101d_32x8d - 83.69 @ 224, 84.35 @ 288
  • seresnextaa101d_32x8d - 83.85 @ 224, 84.57 @ 288
  • resnetrs200 - 83.85 @ 256, 84.44 @ 320

Vision transformer experiments -- relpos, residual-post-norm, layer-scale, fc-norm, and GAP

  • vit_relpos_base_patch32_plus_rpn_256 - 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg pool
  • vit_relpos_small_patch16_224 - 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg pool
  • vit_relpos_medium_patch16_rpn_224 - 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg pool
  • vit_base_patch16_rpn_224 - 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg pool
  • vit_relpos_medium_patch16_224 - 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg pool
  • vit_relpos_base_patch16_224 - 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg pool
  • vit_relpos_base_patch16_gapcls_224 - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)

MobileViT weights

31 Jan 23:25
Compare
Choose a tag to compare

Pretrained weights for MobileViT and MobileViT-V2 adapted from Apple impl at https://github.com/apple/ml-cvnets

Checkpoints remapped to timm impl of the model with BGR corrected to RGB (for V1).

v0.5.4 - More weights, models. ResNet strikes back, self-attn - convnet hybrids, optimizers and more

17 Jan 05:03
Compare
Choose a tag to compare
Default conv_mlp to False across the board for ConvNeXt, causing issu…

v0.1-rsb-weights

04 Oct 00:02
b5bf4dc
Compare
Choose a tag to compare

Weights for ResNet Strikes Back

Paper: https://arxiv.org/abs/2110.00476

More details on weights and hparams to come...

v0.1-attn-weights

04 Sep 00:25
Compare
Choose a tag to compare

A collection of weights I've trained comparing various types of SE-like (SE, ECA, GC, etc), self-attention (bottleneck, halo, lambda) blocks, and related non-attn baselines.

ResNet-26-T series

  • [2, 2, 2, 2] repeat Bottlneck block ResNet architecture
  • ReLU activations
  • 3 layer stem with 24, 32, 64 chs, max-pool
  • avg pool in shortcut downsample
  • self-attn blocks replace 3x3 in both blocks for last stage, and second block of penultimate stage
model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
botnet26t_256 79.246 20.754 94.53 5.47 12.49 256 0.95 bicubic
halonet26t 79.13 20.87 94.314 5.686 12.48 256 0.95 bicubic
lambda_resnet26t 79.112 20.888 94.59 5.41 10.96 256 0.94 bicubic
lambda_resnet26rpt_256 78.964 21.036 94.428 5.572 10.99 256 0.94 bicubic
resnet26t 77.872 22.128 93.834 6.166 16.01 256 0.94 bicubic

Details:

  • HaloNet - 8 pixel block size, 2 pixel halo (overlap), relative position embedding
  • BotNet - relative position embedding
  • Lambda-ResNet-26-T - 3d lambda conv, kernel = 9
  • Lambda-ResNet-26-RPT - relative position embedding

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnet26t 2967.55 86.252 256 256 857.62 297.984 256 256 16.01
botnet26t_256 2642.08 96.879 256 256 809.41 315.706 256 256 12.49
halonet26t 2601.91 98.375 256 256 783.92 325.976 256 256 12.48
lambda_resnet26t 2354.1 108.732 256 256 697.28 366.521 256 256 10.96
lambda_resnet26rpt_256 1847.34 138.563 256 256 644.84 197.892 128 256 10.99

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnet26t 3691.94 69.327 256 256 1188.17 214.96 256 256 16.01
botnet26t_256 3291.63 77.76 256 256 1126.68 226.653 256 256 12.49
halonet26t 3230.5 79.232 256 256 1077.82 236.934 256 256 12.48
lambda_resnet26rpt_256 2324.15 110.133 256 256 864.42 147.485 128 256 10.99
lambda_resnet26t Not Supported

ResNeXT-26-T series

  • [2, 2, 2, 2] repeat Bottlneck block ResNeXt architectures
  • SiLU activations
  • grouped 3x3 convolutions in bottleneck, 32 channels per group
  • 3 layer stem with 24, 32, 64 chs, max-pool
  • avg pool in shortcut downsample
  • channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
  • when active, self-attn blocks replace 3x3 conv in both blocks for last stage, and second block of penultimate stage
model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
eca_halonext26ts 79.484 20.516 94.600 5.400 10.76 256 0.94 bicubic
eca_botnext26ts_256 79.270 20.730 94.594 5.406 10.59 256 0.95 bicubic
bat_resnext26ts 78.268 21.732 94.1 5.9 10.73 256 0.9 bicubic
seresnext26ts 77.852 22.148 93.784 6.216 10.39 256 0.9 bicubic
gcresnext26ts 77.804 22.196 93.824 6.176 10.48 256 0.9 bicubic
eca_resnext26ts 77.446 22.554 93.57 6.43 10.3 256 0.9 bicubic
resnext26ts 76.764 23.236 93.136 6.864 10.3 256 0.9 bicubic

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnext26ts 3006.57 85.134 256 256 864.4 295.646 256 256 10.3
seresnext26ts 2931.27 87.321 256 256 836.92 305.193 256 256 10.39
eca_resnext26ts 2925.47 87.495 256 256 837.78 305.003 256 256 10.3
gcresnext26ts 2870.01 89.186 256 256 818.35 311.97 256 256 10.48
eca_botnext26ts_256 2652.03 96.513 256 256 790.43 323.257 256 256 10.59
eca_halonext26ts 2593.03 98.705 256 256 766.07 333.541 256 256 10.76
bat_resnext26ts 2469.78 103.64 256 256 697.21 365.964 256 256 10.73

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

NOTE: there are performance issues with certain grouped conv configs with channels last layout, backwards pass in particular is really slow. Also causing issues for RegNet and NFNet networks.

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnext26ts 3952.37 64.755 256 256 608.67 420.049 256 256 10.3
eca_resnext26ts 3815.77 67.074 256 256 594.35 430.146 256 256 10.3
seresnext26ts 3802.75 67.304 256 256 592.82 431.14 256 256 10.39
gcresnext26ts 3626.97 70.57 256 256 581.83 439.119 256 256 10.48
eca_botnext26ts_256 3515.84 72.8 256 256 611.71 417.862 256 256 10.59
eca_halonext26ts 3410.12 75.057 256 256 597.52 427.789 256 256 10.76
bat_resnext26ts 3053.83 83.811 256 256 533.23 478.839 256 256 10.73

ResNet-33-T series.

  • [2, 3, 3, 2] repeat Bottlneck block ResNet architecture
  • SiLU activations
  • 3 layer stem with 24, 32, 64 chs, no max-pool, 1st and 3rd conv stride 2
  • avg pool in shortcut downsample
  • channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
  • when active, self-attn blocks replace 3x3 conv last block of stage 2 and 3, and both blocks of final stage
  • FC 1x1 conv between last block and classifier

The 33-layer models have an extra 1x1 FC layer between last conv block and classifier. There is both a non-attenion 33 layer baseline and a 32 layer without the extra FC.

model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
sehalonet33ts 80.986 19.014 95.272 4.728 13.69 256 0.94 bicubic
seresnet33ts 80.388 19.612 95.108 4.892 19.78 256 0.94 bicubic
eca_resnet33ts 80.132 19.868 95.054 4.946 19.68 256 0.94 bicubic
gcresnet33ts 79.99 20.01 94.988 5.012 19.88 256 0.94 bicubic
resnet33ts 79.352 20.648 94.596 5.404 19.68 256 0.94 bicubic
resnet32ts 79.028...
Read more

v0.4.12. Vision Transformer AugReg support and more

30 Jun 16:35
Compare
Choose a tag to compare
  • Vision Transformer AugReg weights and model defs (https://arxiv.org/abs/2106.10270)
  • ResMLP official weights
  • ECA-NFNet-L2 weights
  • gMLP-S weights
  • ResNet51-Q
  • Visformer, LeViT, ConViT, Twins
  • Many fixes, improvements, better test coverage