TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Welcome to the official code repository for "TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation".

Your star means a lot for us to develop this project! ⭐⭐⭐

📰 News

[2025/08/05] 🔥 We release the training code!
[2025/06/05] 🔥 We release the code and models!
[2025/05/09] 🚀 Our paper is available on arXiv!

👀 Introduction

We introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.

🔧 Installation

conda create -n toklip python=3.10 -y
conda activate toklip
git clone https://github.com/TencentARC/TokLIP
pip install --upgrade pip 
pip install -r requirements.txt

⚙️ Usage

Model Weight

Model	Resolution	IN Top1	COCO TR@1	COCO IR@1	Weight
TokLIP-S	256	76.4	64.06	48.46	🤗 TokLIP_S_256
TokLIP-L	384	80.0	68.00	52.87	🤗 TokLIP_L_384

TokLIP-XL with 512x512 resolution will be released soon!

Training

Please refer to img2dataset to prepare the WebDataset required for training. You may choose datasets such as CC3M, CC12M, or LAION.

Prepare the teacher models using src/covert.py:

cd src
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt'
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'

Train TokLIP using the scripts src\train_toklip_256.sh and src\train_toklip_384.sh. You need to set --train-data and --train-num-samples arguments accordingly.

Evaluation

Please first download the TokLIP model weights.

We provide the evalution scripts for ImageNet classification and MSCOCO Retrieval in src\test_toklip_256.sh and src\test_toklip_384.sh.

Please revise the --pretrained, --imagenet-val, and --coco-dir with your specific paths.

Inference

We provide the inference example in src/inference.py.

cd src
python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrained 'YOUR_TOKLIP_PATH'

Model Usage

We provide build_toklip_encoder function in src/create_toklip.py, you could direct load TokLIP with model, image_size, and model_path parameters.

🔜 TODOs

Release training codes.
Release TokLIP-XL with 512 resolution.

📂 Contact

If you have further questions, please open an issue or contact [email protected].

Discussions and potential collaborations are also welcome.

🙏 Acknowledgement

This repo is build upon the following projects:

We thank the authors for their codes.

📝 Citation

Please cite our work if you use our code or discuss our findings in your own research:

@article{lin2025toklip,
  title={TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation},
  author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
  journal={arXiv preprint arXiv:2505.05422},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

📰 News

👀 Introduction

🔧 Installation

⚙️ Usage

Model Weight

Training

Evaluation

Inference

Model Usage

🔜 TODOs

📂 Contact

🙏 Acknowledgement

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

TencentARC/TokLIP

Folders and files

Latest commit

History

Repository files navigation

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

📰 News

👀 Introduction

🔧 Installation

⚙️ Usage

Model Weight

Training

Evaluation

Inference

Model Usage

🔜 TODOs

📂 Contact

🙏 Acknowledgement

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages