GitHub - Chaouki-AI/VisAlign: VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval

VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval

This project presents a multi-modal training framework where a trainable visual encoder (based on either ResNet or Swin Transformer)is aligned with a frozen, lightweight SentenceTransformer textual encoder. The objective is to learn a robust visual embedding space by mapping image features to match those of a pre-trained textual encoder, which serves as a semantic reference.

During training, both matching and non-matching image-caption pairs are used with a contrastive loss to optimize visual representations. Only the visual encoder is used during inference, enabling caption-free image similarity detection and content-based image retrieval. The framework benefits significantly from stronger textual encoders as they provide high-quality semantic targets that enhance the learned visual features.

🏁 Getting Started

These instructions will help you set up and run the project on your local machine for training and evaluation purposes.

Prerequisites

Ensure you have Git installed, then clone the repository:

git clone https://github.com/Chaouki-AI/VisAlign
cd VisAlign/

📦 Installation

Ensure you have Anaconda installed on your machine. Then, run the following command to set up the environment:

conda create -n VisAlign python=3.9 -y
conda activate VisAlign

chmod +x ./installEnv.sh
./installEnv.sh

🚀 Training and Visualization

Training

To train the model, update the args.txt file. Then, start training with:

python main.py @args.txt

you can check the evolution of the training via tensorboard. where the log directory will created in the same folder of the project.

Below are the images showing the training loss evolution (per iteration) and the evaluation loss (mean over each epoch).

After training, you can use the saved model (stored in the generated checkpoint folder) to visualize and evaluate the similarity between images and text. The cosine similarity is computed based on the outputs of the visual content encoder, while the text embeddings are generated using the pretrained textual encoder, specifically all-MiniLM-L12-v2 in our case.

✍️ Author

M. Chaouki ZIARA is affiliated with the RCAM Laboratory, Department of Electronics, Djillali Liabes University, Sidi Bel Abbes, Algeria. (Email: [email protected], [email protected]) – concept creator, algorithm development, implementation, and manuscript writing.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
imgs		imgs
src		src
.gitignore		.gitignore
Readme.md		Readme.md
Results.ipynb		Results.ipynb
args.txt		args.txt
installEnv.sh		installEnv.sh
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval

📝 Table of Contents

🏁 Getting Started

Prerequisites

📦 Installation

🚀 Training and Visualization

Training

✍️ Author

About

Uh oh!

Releases

Packages

Languages

Chaouki-AI/VisAlign

Folders and files

Latest commit

History

Repository files navigation

VisAlign: Aligning Visual Representations with Textual Semantics for Image Similarity and Retrieval

📝 Table of Contents

🏁 Getting Started

Prerequisites

📦 Installation

🚀 Training and Visualization

Training

✍️ Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages