Improving visual similarity search accuracy - model recommendations? #166577
Replies: 3 comments 3 replies
-
I've built similar systems. Here's what actually moves the needle: Models That WorkBest performers:
Product-specific:
Key Techniques1. Multi-scale extraction
2. Background removal
3. Region features
4. Fine-tuning
Architecture TipsTwo-stage search:
Hybrid scoring:
Query expansion:
What Gave Us Best Results
Pro tip: A/B test everything. What works varies by product type. |
Beta Was this translation helpful? Give feedback.
-
anytime @matthiaskaminski Happy to dive into the implementation details: 1. Embedding CombinationWe use weighted concatenation for best results: def combine_embeddings(clip_emb, dino_emb, convnext_emb):
# Normalize each separately first
clip_norm = clip_emb / np.linalg.norm(clip_emb)
dino_norm = dino_emb / np.linalg.norm(dino_emb)
convnext_norm = convnext_emb / np.linalg.norm(convnext_emb)
# Concatenate with weights
combined = np.concatenate([
clip_norm * 0.4,
dino_norm * 0.3,
convnext_norm * 0.3
])
# Final normalization
return combined / np.linalg.norm(combined) We tried averaging and separate searches - concatenation won by ~15%. Just make sure your vector DB handles the larger dimensions well. 2. Background Removal StrategyDatabase: Yes, preprocessed offline # Production setup:
if latency_critical:
# Just center crop + padding (30ms)
query_img = center_pad_square(query_img)
else:
# Full removal (200-300ms with rembg)
query_img = remove_background(query_img) We cache removed backgrounds for frequent queries. Also, lightweight U²-Net models can do decent removal in ~100ms. 3. Region FeaturesWe use single combined vector per product: def create_product_embedding(image):
features = []
# Full image
features.append(model(image))
# Center crop (80% of image)
center = center_crop(image, 0.8)
features.append(model(center))
# 2x2 grid patches
for patch in extract_grid(image, 2, 2):
features.append(model(patch))
# Average pool all features
combined = np.mean(features, axis=0)
return normalize(combined) Multiple vectors per product works too but complicates retrieval logic and storage. 4. Custom Training ExperienceYes, I know mixed results: What worked:
# Our best custom training setup
model = timm.create_model('vit_base_patch16_224', pretrained=True)
# Replace classifier with ArcFace head
model.head = ArcFaceHead(embed_dim=768, num_classes=num_products)
# Loss combines:
loss = 0.7 * arcface_loss + 0.3 * triplet_loss Results:
The effort to beat CLIP+DINOv2 ensemble is usually not worth it unless you have very specific domain requirements (like medical devices or specialized fashion). If you go custom, PyTorch Metric Learning's |
Beta Was this translation helpful? Give feedback.
-
I've worked on similar image retrieval tasks, and you're on the right track using DINOv2, OpenCLIP, and Qdrant. Here are some ideas to help improve accuracy: Model Architectures Fine-tuning OpenCLIP on your own product images (even with a small dataset) can improve domain alignment significantly. Multimodal hybrids: Combine image embeddings (e.g., from DINOv2) with text embeddings using a projection layer or late fusion. Improving Embedding Quality Apply hard negative mining to teach the model to better distinguish subtle differences. Consider dimensionality reduction after normalization to improve clustering and separation. Best Practices Use cosine similarity for CLIP-style embeddings. For better precision, re-rank top-N results with a small classifier or a similarity model trained on actual search relevance data. Periodically retrain the model or refresh the vector index as your product catalog evolves. If you have weak labels (e.g., category, brand), you can also incorporate them into the loss function to guide the embedding space. Let me know if you'd like to compare pipeline structures. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Question
Working on a visual similarity search system where users upload images to find similar items in a product database. What I've tried: - OpenAI text embeddings on product descriptions - DINOv2 for visual features - OpenCLIP multimodal approach - Vector search using Qdrant Results are decent but not great - looking to improve accuracy. Has anyone worked on similar image retrieval challenges? Specifically interested in: - Model architectures that work well for product similarity - Techniques to improve embedding quality - Best practices for this type of search Any insights appreciated!
Beta Was this translation helpful? Give feedback.
All reactions