Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space (CVPR2025)
HyperCLIP is a lightweight and effective fine-tuning framework built upon CLIP for open-vocabulary semantic segmentation. Motivated by the observation that segmentation requires alignment at pixel-level hierarchical granularity, this work explores fine-tuning CLIP in hyperbolic space, which shifts the hierarchical granularity of CLIP's embedding from image-level to pixel-level, thereby equipping it with segmentation capability.
- Hyperbolic radius alignment via fine-tuning: The hyperbolic radius of CLIP's text embeddings decreases, showing that the text encoder shifts from image-to-text to pixel-to-text alignment.
- Hyperbolic radius adjustment: HyperCLIP explicitly introduces hyperbolic radius adjustment for CLIP's embeddings to better align vision and language representations in hyperbolic space.
- Parameter efficiency: Only ~4% of CLIP’s parameters are fine-tuned, yet HyperCLIP attains state-of-the-art performance across three open-vocabulary segmentation benchmarks.
- Characteristic hyperbolic level: After fine-tuning, text embeddings converge to a stable hyperbolic radius across different datasets, suggesting that segmentation tasks correspond to a characteristic hierarchy level in hyperbolic geometry.
The figure below illustrates how CLIP embeddings evolve during HyperCLIP fine-tuning:
- Image-level semantics (large radius) → pixel-level semantics (smaller radius).
Please refer to the CAT-Seg repository for guidance on:
- Environment setup (Python version, dependencies, etc.)
- Dataset preparation (e.g., COCO, ADE20K, Pascal VOC)
You can launch the entire training and evaluation pipeline using:
bash run_train_test.sh
Thanks to the excellent works and their codebases of CAT-Seg.
Please consider citing our paper if the code is helpful in your research and development.
@inproceedings{peng2025understanding,
title={Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space},
author={Peng, Zelin and Xu, Zhengqin and Zeng, Zhilin and Wen, Changsong and Huang, Yu and Yang, Menglin and Tang, Feilong and Shen, Wei},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={4562--4572},
year={2025}
}
