Improving visual similarity search accuracy - model recommendations? #166577

matthiaskaminski · 2025-07-17T21:15:36Z

matthiaskaminski
Jul 17, 2025

Question

Working on a visual similarity search system where users upload images to find similar items in a product database. What I've tried: - OpenAI text embeddings on product descriptions - DINOv2 for visual features - OpenCLIP multimodal approach - Vector search using Qdrant Results are decent but not great - looking to improve accuracy. Has anyone worked on similar image retrieval challenges? Specifically interested in: - Model architectures that work well for product similarity - Techniques to improve embedding quality - Best practices for this type of search Any insights appreciated!

ToluGIT · 2025-07-18T01:34:21Z

ToluGIT
Jul 18, 2025

I've built similar systems. Here's what actually moves the needle:

Models That Work

Best performers:

CLIP ViT-L/14@336px - Higher resolution captures product details
EVA-CLIP or BLIP-2 - Outperform vanilla CLIP
Ensemble approach - Combine CLIP (40%), DINOv2 (30%), ConvNeXt (30%)

Product-specific:

FashionCLIP for apparel
ALIGN for general e-commerce

Key Techniques

1. Multi-scale extraction

Process images at 224, 336, 448px
Average the embeddings

2. Background removal

Use Rembg or SAM
Single biggest preprocessing win

3. Region features

Don't just use global embedding
Extract from: full image, center crop, grid patches

4. Fine-tuning

Triplet loss with category-aware hard negatives
Products from same category but different style

Architecture Tips

Two-stage search:

Coarse retrieval (top 1000)
Re-rank top 100 with heavier model + metadata

Hybrid scoring:

Visual similarity (70%)
Category match (15%)
Brand/metadata (15%)

Query expansion:

Search with original + flipped + cropped versions
Merge and deduplicate results

What Gave Us Best Results

CLIP + DINOv2 ensemble: 40% improvement
Background removal: 15% improvement
Category-aware fine-tuning: 20% improvement
Multi-scale features: 10% improvement

Pro tip: A/B test everything. What works varies by product type.

1 reply

matthiaskaminski Jul 18, 2025
Author

Wow, seriously – huge thanks for the detailed reply. This is absolute gold. It’s super helpful to hear from someone who’s actually built a similar system in the real world.

A few quick questions, if you don’t mind:

How exactly do you combine embeddings in that ensemble (CLIP + DINOv2 + ConvNeXt)? Are you doing a weighted average of vectors, concatenation, or separate searches and then merging the results?
For background removal – do you apply it only on the product database, or also on the query image at runtime? I’m trying to figure out what’s realistic latency-wise in production.
Those region features (center crop, grid patches, etc.) – do you treat them as separate embeddings in the index (like multiple per product), or do you combine them into a single vector somehow?
Curious if you’ve ever trained your own similarity model? Like fine-tuning ResNet or ViT using something like ranked list loss (from the PyTorch Metric Learning library)? Some folks recommend it for small datasets (~2–5k items), but I wonder if it actually beats the ensemble + clever scoring.

Again, really appreciate you sharing all this – tons of useful stuff to test.

ToluGIT · 2025-07-18T08:05:28Z

ToluGIT
Jul 18, 2025

anytime @matthiaskaminski

Happy to dive into the implementation details:

1. Embedding Combination

We use weighted concatenation for best results:

def combine_embeddings(clip_emb, dino_emb, convnext_emb):
    # Normalize each separately first
    clip_norm = clip_emb / np.linalg.norm(clip_emb)
    dino_norm = dino_emb / np.linalg.norm(dino_emb) 
    convnext_norm = convnext_emb / np.linalg.norm(convnext_emb)
    
    # Concatenate with weights
    combined = np.concatenate([
        clip_norm * 0.4,
        dino_norm * 0.3,
        convnext_norm * 0.3
    ])
    
    # Final normalization
    return combined / np.linalg.norm(combined)

We tried averaging and separate searches - concatenation won by ~15%. Just make sure your vector DB handles the larger dimensions well.

2. Background Removal Strategy

Database: Yes, preprocessed offline
Query: Depends on your latency budget

# Production setup:
if latency_critical:
    # Just center crop + padding (30ms)
    query_img = center_pad_square(query_img)
else:
    # Full removal (200-300ms with rembg)
    query_img = remove_background(query_img)

We cache removed backgrounds for frequent queries. Also, lightweight U²-Net models can do decent removal in ~100ms.

3. Region Features

We use single combined vector per product:

def create_product_embedding(image):
    features = []
    
    # Full image
    features.append(model(image))
    
    # Center crop (80% of image)
    center = center_crop(image, 0.8)
    features.append(model(center))
    
    # 2x2 grid patches
    for patch in extract_grid(image, 2, 2):
        features.append(model(patch))
    
    # Average pool all features
    combined = np.mean(features, axis=0)
    return normalize(combined)

Multiple vectors per product works too but complicates retrieval logic and storage.

4. Custom Training Experience

Yes, I know mixed results:

What worked:

Fine-tuning ViT-B/16 with ArcFace loss (better than triplet)
Semi-hard negative mining crucial
Augmentation during training (color, crop, rotation)

# Our best custom training setup
model = timm.create_model('vit_base_patch16_224', pretrained=True)
# Replace classifier with ArcFace head
model.head = ArcFaceHead(embed_dim=768, num_classes=num_products)

# Loss combines:
loss = 0.7 * arcface_loss + 0.3 * triplet_loss

Results:

2-5k products: Custom model competitive with CLIP ensemble
10k+ products: Pretrained ensemble usually wins
Sweet spot: Fine-tune CLIP itself rather than training from scratch

The effort to beat CLIP+DINOv2 ensemble is usually not worth it unless you have very specific domain requirements (like medical devices or specialized fashion).

If you go custom, PyTorch Metric Learning's SubCenterArcFaceLoss gave us best results for product similarity.

1 reply

matthiaskaminski Jul 18, 2025
Author

Thanks for the incredibly detailed implementation! Few quick follow-ups:
Technical details:

For ConvNeXt in the ensemble - any specific model size you'd recommend? (tiny/small/base)
With concatenation we're looking at ~2048+ dims - any FAISS indexing recommendations for this dimension?
Memory usage rough estimate for ~10k products with this approach?

Domain-specific:

Have you tested this on furniture/home goods specifically? I'm working with furniture products where shape/style matters more than texture
Would you adjust ensemble weights for furniture vs fashion? (maybe higher DINOv2 weight for structural similarity?)

Production:

Background removal caching - how do you handle the storage vs compute tradeoff?
Any A/B testing tips for measuring improvement systematically?
Do you use mutual/two-sided nearest neighbor checks for filtering results? Someone mentioned checking if A finds B as NN AND B finds A as NN for more confident matches.

Really appreciate you sharing the production-ready code. This gives me a clear roadmap to implement!

gugs881 · 2025-07-18T17:38:44Z

gugs881
Jul 18, 2025

Question

Working on a visual similarity search system where users upload images to find similar items in a product database. What I've tried: - OpenAI text embeddings on product descriptions - DINOv2 for visual features - OpenCLIP multimodal approach - Vector search using Qdrant Results are decent but not great - looking to improve accuracy. Has anyone worked on similar image retrieval challenges? Specifically interested in: - Model architectures that work well for product similarity - Techniques to improve embedding quality - Best practices for this type of search Any insights appreciated!

I've worked on similar image retrieval tasks, and you're on the right track using DINOv2, OpenCLIP, and Qdrant. Here are some ideas to help improve accuracy:

Model Architectures
ResNet50 + ArcFace or EfficientNet + CosFace can perform well when you have labeled product categories and train with a metric learning objective.

Fine-tuning OpenCLIP on your own product images (even with a small dataset) can improve domain alignment significantly.

Multimodal hybrids: Combine image embeddings (e.g., from DINOv2) with text embeddings using a projection layer or late fusion.

Improving Embedding Quality
Use supervised contrastive learning if you have similar/dissimilar product pairs.

Apply hard negative mining to teach the model to better distinguish subtle differences.

Consider dimensionality reduction after normalization to improve clustering and separation.

Best Practices
Normalize embeddings (L2) before indexing in Qdrant.

Use cosine similarity for CLIP-style embeddings.

For better precision, re-rank top-N results with a small classifier or a similarity model trained on actual search relevance data.

Periodically retrain the model or refresh the vector index as your product catalog evolves.

If you have weak labels (e.g., category, brand), you can also incorporate them into the loss function to guide the embedding space. Let me know if you'd like to compare pipeline structures.

1 reply

matthiaskaminski Jul 18, 2025
Author

Thanks! For now, I will try to combine these three embeddings - text, clip, and dinov2, and implement feiss with L2 normalization and mutual NN verification. If that doesn't work, there will be no other option but to train own model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Improving visual similarity search accuracy - model recommendations? #166577

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Improving visual similarity search accuracy - model recommendations? #166577

Uh oh!

Uh oh!

matthiaskaminski Jul 17, 2025

Replies: 3 comments · 3 replies

Uh oh!

ToluGIT Jul 18, 2025

Models That Work

Key Techniques

Architecture Tips

What Gave Us Best Results

Uh oh!

matthiaskaminski Jul 18, 2025 Author

Uh oh!

ToluGIT Jul 18, 2025

1. Embedding Combination

2. Background Removal Strategy

3. Region Features

4. Custom Training Experience

Uh oh!

Uh oh!

matthiaskaminski Jul 18, 2025 Author

Uh oh!

gugs881 Jul 18, 2025

Uh oh!

matthiaskaminski Jul 18, 2025 Author

matthiaskaminski
Jul 17, 2025

Replies: 3 comments 3 replies

ToluGIT
Jul 18, 2025

matthiaskaminski Jul 18, 2025
Author

ToluGIT
Jul 18, 2025

matthiaskaminski Jul 18, 2025
Author

gugs881
Jul 18, 2025

matthiaskaminski Jul 18, 2025
Author