SimCLR ViT implementation

This repo implements the SimCLR algorithm on Vision Transformers (ViT) for both GPUs and TPUs, with hyperparams following An Empirical Study of Training Self-Supervised Vision Transformers.

Installation

Install pytorch (and its dependencies). Install pytorch xla if running on TPUs.

Finally, install timm for vision transformers: pip3 install timm.

Download ImageNet-1k to a shared directory (e.g. to /checkpoint/ronghanghu/megavlt_paths/imagenet-1k) that can be accessed from all nodes, which should have the following structure.

/checkpoint/ronghanghu/megavlt_paths/imagenet-1k
|_ train
|  |_ <n0......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-N-name>.JPEG
|  |_ ...
|  |_ <n1......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-M-name>.JPEG
|  |  |_...
|  |  |_...
|_ val
|  |_ <n0......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-N-name>.JPEG
|  |_ ...
|  |_ <n1......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-M-name>.JPEG
|  |  |_...
|  |  |_...

Running SimCLR ViT training on ImageNet-1k

Launch the training on GPUs or TPUs as follows.

Make sure SAVE_DIR is a shared directory that can be accessed from all nodes. For TPUs, one can use an NFS directory on GCP.

On GPUs (e.g. using 64 V100 GPUs):

SAVE_DIR="/private/home/ronghanghu/workspace/simclr_vit_release/save_gpu64"

srun \
  --mem=300g --nodes=8 --gres=gpu:8 --partition=learnlab,learnfair \
  --time=4300 --constraint=volta32gb --cpus-per-task=40 \
python3 run_simclr_vit.py \
  world_size=64 \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k

(append use_pytorch_amp=True to the command above to use automatic mixed precision)

On TPUs (e.g. using a v3-256 TPU pod):

SAVE_DIR="/checkpoint/ronghanghu/workspace/simclr_vit_release/save_tpu_v3-256"

TPU_NAME=megavlt-256  # change to your TPU name
# use absolute paths with torch_xla.distributed.xla_dist
sudo mkdir -p $SAVE_DIR && sudo chmod -R 777 $SAVE_DIR  # workaround for permission issue
python3 -m torch_xla.distributed.xla_dist \
  --tpu=${TPU_NAME} --restart-tpuvm-pod \
  --env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 \
  -- \
python3 $(realpath run_simclr_vit.py) \
  device=xla \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k

Running linear evaluation on the trained model

Suppose the final checkpoint from the previous step is /checkpoint/ronghanghu/workspace/simclr_vit_release/save_tpu_v3-256/simclr_vit_epoch_300.ckpt. Let's evaluate it as follows. Expected linear evaluation accuracy is around 0.739 for both GPUs and TPUs.

Make sure SAVE_DIR is a shared directory that can be accessed from all nodes. For TPUs, one can use an NFS directory on GCP.

On GPUs (e.g. using 64 V100 GPUs):

PRETRAINED_MODEL=/private/home/ronghanghu/workspace/simclr_vit_release/save_gpu64/simclr_vit_epoch_300.ckpt
# SAVE_DIR can be the same or a different directory from SSL training
SAVE_DIR="/private/home/ronghanghu/workspace/simclr_vit_release/save_gpu64"

srun \
  --mem=300g --nodes=8 --gres=gpu:8 --partition=learnlab,learnfair \
  --time=4300 --constraint=volta32gb --cpus-per-task=40 \
python3 $(realpath run_linear_eval_vit.py) \
  world_size=64 \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  linear_eval.pretrained_ckpt_path=$PRETRAINED_MODEL

On TPUs (e.g. using a v3-256 TPU pod):

PRETRAINED_MODEL=/checkpoint/ronghanghu/workspace/simclr_vit_release/save_tpu_v3-256/simclr_vit_epoch_300.ckpt
# SAVE_DIR can be the same or a different directory from SSL training
SAVE_DIR="/checkpoint/ronghanghu/workspace/simclr_vit_release/save_tpu_v3-256"

TPU_NAME=megavlt-256  # change to your TPU name
# use absolute paths with torch_xla.distributed.xla_dist
sudo mkdir -p $SAVE_DIR && sudo chmod -R 777 $SAVE_DIR  # workaround for permission issue
python3 -m torch_xla.distributed.xla_dist \
  --tpu=${TPU_NAME} --restart-tpuvm-pod \
  --env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 \
  -- \
python3 $(realpath run_linear_eval_vit.py) \
  device=xla \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  linear_eval.pretrained_ckpt_path=$PRETRAINED_MODEL

TPU profiling with XLA profiler

Following PyTorch XLA performance profiling, on a TPU VM node, one can first start a tensorboard session with tensorboard --logdir . and launch the training scripts below. After the training starts for a while (e.g. after 100 steps when the speed becomes stable), capture the profile from localhost:3294 in the Profile tab of tensorboard.

Run profiling with fake data (no actual data loading) on a single VM node w/ 8 TPU cores:

export PT_XLA_DEBUG=1
export XLA_HLO_DEBUG=1

python3 run_simclr_vit_profiler.py \
  device=xla \
  fake_data=True \
  batch_size=128 lr=0.0  # zero lr to avoid divergence

Run profiling with real data on a single VM node w/ 8 TPU cores:

export PT_XLA_DEBUG=1
export XLA_HLO_DEBUG=1

python3 run_simclr_vit_profiler.py \
  device=xla \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  batch_size=128 lr=0.0  # zero lr to avoid divergence

Run profiling with fake data but using PyTorch dataloader on a single VM node w/ 8 TPU cores:

export PT_XLA_DEBUG=1
export XLA_HLO_DEBUG=1

python3 run_simclr_vit_profiler_fakewithdataloader.py \
  device=xla \
  fake_data=True \
  batch_size=128 lr=0.0  # zero lr to avoid divergence

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
my_webdataset		my_webdataset
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
WDS_CLIP_NO_EPOCHS.py		WDS_CLIP_NO_EPOCHS.py
WDS_CLIP_NO_EPOCHS_no_ckpting.py		WDS_CLIP_NO_EPOCHS_no_ckpting.py
WDS_run_clip_vit.py		WDS_run_clip_vit.py
WDS_run_simclr_vit.py		WDS_run_simclr_vit.py
bagnet.py		bagnet.py
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
config.py		config.py
config_simclr.py		config_simclr.py
distributed.py		distributed.py
losses.py		losses.py
minimal_nan.py		minimal_nan.py
models.py		models.py
parallel_protonet.py		parallel_protonet.py
parallel_transformer.py		parallel_transformer.py
requirements.txt		requirements.txt
run_clip_vit.py		run_clip_vit.py
run_isola_clip_vit.py		run_isola_clip_vit.py
run_linear_eval_vit.py		run_linear_eval_vit.py
run_simclr_vit.py		run_simclr_vit.py
run_simclr_vit_profiler.py		run_simclr_vit_profiler.py
run_simclr_vit_profiler_fakewithdataloader.py		run_simclr_vit_profiler_fakewithdataloader.py
schedulers.py		schedulers.py
slip_models.py		slip_models.py
test_ckpt_basic_CLIP.ipynb		test_ckpt_basic_CLIP.ipynb
test_multimodel_ckpt.ipynb		test_multimodel_ckpt.ipynb
test_multimodel_ckpt_tmp_sandbox_slip.ipynb		test_multimodel_ckpt_tmp_sandbox_slip.ipynb
tokenizer.py		tokenizer.py
transforms.py		transforms.py
utils.py		utils.py
xla_sync_bn.py		xla_sync_bn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SimCLR ViT implementation

Installation

Running SimCLR ViT training on ImageNet-1k

Running linear evaluation on the trained model

TPU profiling with XLA profiler

About

Uh oh!

Releases

Packages

Languages

bram-w/my_simclr_vit_profiling

Folders and files

Latest commit

History

Repository files navigation

SimCLR ViT implementation

Installation

Running SimCLR ViT training on ImageNet-1k

Running linear evaluation on the trained model

TPU profiling with XLA profiler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages