Skip to content

facebookresearch/maws

MAWS

[Paper] [Colab] [BibTex] [Website]

Repository for the strong foundational MAWS + MAE models at all sizes ranging from <100M parameters to >6.5B parameters, from the paper The effectiveness of MAE pre-pretraining for billion-scale pretraining. Models are available for both MAE pre-pretraining and the follow up WSP pretraining, MAE→WSP a.k.a. MAWS (Masked Autoencoding → Weakly Supervised pretraining).

image

Getting started

To get started with playing with our models immediately, we have a notebook available to play with on Colab, or locally for running our models in zero-shot mode.

For building any of our models, select which model type you would like to build. We have models available for:

  1. model_type="maws": MAWS (MAE→WSP) pretraining, i.e. MAE pre-pretraining followed by WSP pretraining. We also have ImageNet-1k finetuned weights for MAWS models using the same model type.
  2. model_type="maws_clip": MAWS pretrained models along with LiT aligned text encoders for CLIP style zero shot classification
  3. model_type="mae": MAE pretrained models
  4. model_type="mae_in1k": MAE pretrained on ImageNet-1k models

To access a model, specify the model architecture and the model type:

from maws.model_builder import build_model

# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = build_model("vit_b16_xlmr_b", "maws_clip")

# build a MAWS model
maws_model = build_model("vit_b16", "maws")

# build a MAWS model finetuned on IN1k
maws_in1k_model = build_model("vit_b16_ft_in1k", "maws")

# build an MAE model
mae_model = build_model("vit_b16", "mae")

The models are also available via torch.hub:

# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = torch.hub.load("facebookresearch/maws", model="vit_b16_xlmr_b_maws_clip")

# build a MAWS model
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_maws")

# build a MAWS model finetuned on IN1k
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_ft_in1k_maws")

# build an MAE model
mae_model = torch.hub.load("facebookresearch/maws", model="vit_b16_mae")

We list down all the available models and direct download links in the following section.

Installation instructions

conda create --name maws python=3.10
conda activate maws
pip install torch torchvision torchtext
pip install timm==0.9.7
# for demo
pip install jupyter ipywidgets matplotlib

WARNING

Torchtext has been deprecated which has broken clip model support. If you run without torchtext, all models which aren't clip based will work fine!

Available models

MAWS pretrained models

Model Pretrained name + weights IN1k 224px linear top-1 IN1k 512/518px finetuned name + weights IN1k 512/518px finetuned top-1 Text encoder 0-Shot name + weights IN1k 224px 0-shot top-1
ViT-B vit_b16 83.3 vit_b16_ft_in1k 86.8 XLMR-B vit_b16_xlmr_b 74.9
ViT-L vit_l16 86.1 vit_l16_ft_in1k 88.8 XLMR-L vit_l16_xlmr_l 79.7
ViT-H vit_h14 87.5 vit_h14_ft_in1k 89.5 XLMR-L vit_h14_xlmr_l 81.1
ViT-2B vit_2b14 88.1 vit_2b14_ft_in1k 89.8 XLMR-L vit_2b14_xlmr_l 82.1
ViT-6.5B vit_6.5b14 88.6 vit_6.5b14_ft_in1k 90.1

MAE pretrained models

Model Pretrained name + weights IN1k 224px finetuned top-1
ViT-B vit_b16 83.5
ViT-L vit_l16 86.1
ViT-H vit_h14 87.4
ViT-2B vit_2b14 87.8
ViT-6.5B vit_6.5b14 88.3

MAE pretrained on ImageNet-1k

Model Pretrained name + weights IN1k 224px finetuned top-1
ViT-2B vit_2b14 87.4

MAE pretrained on ImageNet-21k

Model Model name + weights IN1k 512px finetuned
ViT-L vit_l16 86.9

Evaluation on ImageNet-1k

Finetuned

We share weights for the MAWS models finetuned on ImageNet-1k at high resolution (512px for ViT-B, ViT-L and 518px for ViT-H, ViT-2B, ViT-6.5B). $IN1K_VAL_PATH should be the path to the ImageNet-1k val root folder.

python eval_finetuned.py -m vit_b16_ft_in1k -i 512 -b 25 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 86.832

python eval_finetuned.py -m vit_l16_ft_in1k -i 512 -b 10 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 88.796

python eval_finetuned.py -m vit_h14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.502

python eval_finetuned.py -m vit_2b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.752

python eval_finetuned.py -m vit_6.5b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 90.064

Zero-shot

Please refer to all the available model names in the MAWS Pretrained models section. $IN1K_VAL_PATH should be the path to the ImageNet-1k val root folder.

python eval_zeroshot.py -m vit_b16_xlmr_b -b 25 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 74.888

# Trying the french language instead with a larger model on a 32GB V100
python eval_zeroshot.py -m vit_2b14_xlmr_l --language french -b 5 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 62.622

Citation

If you use our models or if the work is useful in your research, please give us a star and cite:

@inproceedings{singh2023effectiveness,
    title={The effectiveness of MAE pre-pretraining for billion-scale pretraining},
    author={Singh, Mannat and Duval, Quentin and Alwala, Kalyan Vasudev and Fan, Haoqi and Aggarwal, Vaibhav and Adcock, Aaron and Joulin, Armand and Doll{\'a}r, Piotr and Feichtenhofer, Christoph and Girshick, Ross and Girdhar, Rohit and Misra, Ishan},
    booktitle={ICCV},
    year={2023}
}

License

Our models are released under the CC-BY-NC 4.0 license. See LICENSE for additional details.

About

Code and models for the paper "The effectiveness of MAE pre-pretraining for billion-scale pretraining" https://arxiv.org/abs/2303.13496

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •