Skip to content

VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval

Notifications You must be signed in to change notification settings

Chaouki-AI/VisAlign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project logo

VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval


This project presents a multi-modal training framework where a trainable visual encoder (based on either ResNet or Swin Transformer)is aligned with a frozen, lightweight SentenceTransformer textual encoder. The objective is to learn a robust visual embedding space by mapping image features to match those of a pre-trained textual encoder, which serves as a semantic reference.

During training, both matching and non-matching image-caption pairs are used with a contrastive loss to optimize visual representations. Only the visual encoder is used during inference, enabling caption-free image similarity detection and content-based image retrieval. The framework benefits significantly from stronger textual encoders as they provide high-quality semantic targets that enhance the learned visual features.

📝 Table of Contents

🏁 Getting Started

These instructions will help you set up and run the project on your local machine for training and evaluation purposes.

Prerequisites

Ensure you have Git installed, then clone the repository:

git clone https://github.com/Chaouki-AI/VisAlign
cd VisAlign/

📦 Installation

Ensure you have Anaconda installed on your machine. Then, run the following command to set up the environment:

conda create -n VisAlign python=3.9 -y
conda activate VisAlign

chmod +x ./installEnv.sh
./installEnv.sh

🚀 Training and Visualization

Training

To train the model, update the args.txt file. Then, start training with:

python main.py @args.txt

you can check the evolution of the training via tensorboard. where the log directory will created in the same folder of the project.

Below are the images showing the training loss evolution (per iteration) and the evaluation loss (mean over each epoch).

After training, you can use the saved model (stored in the generated checkpoint folder) to visualize and evaluate the similarity between images and text. The cosine similarity is computed based on the outputs of the visual content encoder, while the text embeddings are generated using the pretrained textual encoder, specifically all-MiniLM-L12-v2 in our case.

✍️ Author

M. Chaouki ZIARA is affiliated with the RCAM Laboratory, Department of Electronics, Djillali Liabes University, Sidi Bel Abbes, Algeria. (Email: [email protected], [email protected]) – concept creator, algorithm development, implementation, and manuscript writing.

About

VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published