Skip to content

TencentARC/TokLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

arXiv GitHub HuggingFace License

Welcome to the official code repository for "TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation".

Your star means a lot for us to develop this project! ⭐⭐⭐

📰 News

  • [2025/08/05] 🔥 We release the training code!
  • [2025/06/05] 🔥 We release the code and models!
  • [2025/05/09] 🚀 Our paper is available on arXiv!

👀 Introduction

TokLIP

  • We introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens.

  • TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.

  • Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.

🔧 Installation

conda create -n toklip python=3.10 -y
conda activate toklip
git clone https://github.com/TencentARC/TokLIP
pip install --upgrade pip 
pip install -r requirements.txt

⚙️ Usage

Model Weight

Model Resolution IN Top1 COCO TR@1 COCO IR@1 Weight
TokLIP-S 256 76.4 64.06 48.46 🤗 TokLIP_S_256
TokLIP-L 384 80.0 68.00 52.87 🤗 TokLIP_L_384

TokLIP-XL with 512x512 resolution will be released soon!

Training

  1. Please refer to img2dataset to prepare the WebDataset required for training. You may choose datasets such as CC3M, CC12M, or LAION.

  2. Prepare the teacher models using src/covert.py:

    cd src
    TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt'
    TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'
  3. Train TokLIP using the scripts src\train_toklip_256.sh and src\train_toklip_384.sh. You need to set --train-data and --train-num-samples arguments accordingly.

Evaluation

Please first download the TokLIP model weights.

We provide the evalution scripts for ImageNet classification and MSCOCO Retrieval in src\test_toklip_256.sh and src\test_toklip_384.sh.

Please revise the --pretrained, --imagenet-val, and --coco-dir with your specific paths.

Inference

We provide the inference example in src/inference.py.

cd src
python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrained 'YOUR_TOKLIP_PATH'

Model Usage

We provide build_toklip_encoder function in src/create_toklip.py, you could direct load TokLIP with model, image_size, and model_path parameters.

🔜 TODOs

  • Release training codes.
  • Release TokLIP-XL with 512 resolution.

📂 Contact

If you have further questions, please open an issue or contact [email protected].

Discussions and potential collaborations are also welcome.

🙏 Acknowledgement

This repo is build upon the following projects:

We thank the authors for their codes.

📝 Citation

Please cite our work if you use our code or discuss our findings in your own research:

@article{lin2025toklip,
  title={TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation},
  author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
  journal={arXiv preprint arXiv:2505.05422},
  year={2025}
}

About

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published