This project presents a multi-modal training framework where a trainable visual encoder (based on either ResNet or Swin Transformer)is aligned with a frozen, lightweight SentenceTransformer textual encoder. The objective is to learn a robust visual embedding space by mapping image features to match those of a pre-trained textual encoder, which serves as a semantic reference.
During training, both matching and non-matching image-caption pairs are used with a contrastive loss to optimize visual representations. Only the visual encoder is used during inference, enabling caption-free image similarity detection and content-based image retrieval. The framework benefits significantly from stronger textual encoders as they provide high-quality semantic targets that enhance the learned visual features.
These instructions will help you set up and run the project on your local machine for training and evaluation purposes.
Ensure you have Git installed, then clone the repository:
git clone https://github.com/Chaouki-AI/VisAlign
cd VisAlign/
Ensure you have Anaconda installed on your machine. Then, run the following command to set up the environment:
conda create -n VisAlign python=3.9 -y
conda activate VisAlign
chmod +x ./installEnv.sh
./installEnv.sh
To train the model, update the args.txt
file. Then, start training with:
python main.py @args.txt
you can check the evolution of the training via tensorboard. where the log directory will created in the same folder of the project.
Below are the images showing the training loss evolution (per iteration) and the evaluation loss (mean over each epoch).
After training, you can use the saved model (stored in the generated checkpoint folder) to visualize and evaluate the similarity between images and text. The cosine similarity is computed based on the outputs of the visual content encoder, while the text embeddings are generated using the pretrained textual encoder, specifically
all-MiniLM-L12-v2
in our case.
M. Chaouki ZIARA is affiliated with the RCAM Laboratory, Department of Electronics, Djillali Liabes University, Sidi Bel Abbes, Algeria. (Email: [email protected], [email protected]) – concept creator, algorithm development, implementation, and manuscript writing.