Welcome to the official code repository for "TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation".
Your star means a lot for us to develop this project! ⭐⭐⭐
- [2025/08/05] 🔥 We release the training code!
- [2025/06/05] 🔥 We release the code and models!
- [2025/05/09] 🚀 Our paper is available on arXiv!
-
We introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
-
TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
-
Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.
conda create -n toklip python=3.10 -y
conda activate toklip
git clone https://github.com/TencentARC/TokLIP
pip install --upgrade pip
pip install -r requirements.txt
Model | Resolution | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
---|---|---|---|---|---|
TokLIP-S | 256 | 76.4 | 64.06 | 48.46 | 🤗 TokLIP_S_256 |
TokLIP-L | 384 | 80.0 | 68.00 | 52.87 | 🤗 TokLIP_L_384 |
TokLIP-XL with 512x512 resolution will be released soon!
-
Please refer to img2dataset to prepare the WebDataset required for training. You may choose datasets such as CC3M, CC12M, or LAION.
-
Prepare the teacher models using
src/covert.py
:cd src TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt' TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'
-
Train TokLIP using the scripts
src\train_toklip_256.sh
andsrc\train_toklip_384.sh
. You need to set--train-data
and--train-num-samples
arguments accordingly.
Please first download the TokLIP model weights.
We provide the evalution scripts for ImageNet classification and MSCOCO Retrieval in src\test_toklip_256.sh
and src\test_toklip_384.sh
.
Please revise the --pretrained
, --imagenet-val
, and --coco-dir
with your specific paths.
We provide the inference example in src/inference.py
.
cd src
python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrained 'YOUR_TOKLIP_PATH'
We provide build_toklip_encoder
function in src/create_toklip.py
, you could direct load TokLIP with model
, image_size
, and model_path
parameters.
- Release training codes.
- Release TokLIP-XL with 512 resolution.
If you have further questions, please open an issue or contact [email protected].
Discussions and potential collaborations are also welcome.
This repo is build upon the following projects:
We thank the authors for their codes.
Please cite our work if you use our code or discuss our findings in your own research:
@article{lin2025toklip,
title={TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation},
author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
journal={arXiv preprint arXiv:2505.05422},
year={2025}
}