-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Model] Siglip Embedding Support #27324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Documentation preview: https://vllm--27324.org.readthedocs.build/en/27324/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for SigLIP text and image embedding models, which is a great extension of vLLM's multimodal capabilities. The implementation follows the existing architecture for CLIP, including separate handling of text and image inputs. The changes are well-structured, with new tests, examples, and updates to the model registry. I have one suggestion regarding the pooling mechanism to improve performance and memory efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/v1/core/sched/scheduler.py
Outdated
| kv_cache_config=kv_cache_config, | ||
| max_model_len=self.max_model_len, | ||
| enable_caching=self.cache_config.enable_prefix_caching, | ||
| enable_caching=enable_caching, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noooop shouldn't enable prefix caching be disabled for encoder-only models already? Why do we still need this?
noooop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed get_num_image_tokens with detailed documentation. All other issues have been addressed. Ready for re-review.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for SigLIP text and image embedding models. The changes include adding the model to the registry, providing example usage scripts, and implementing the model logic in vllm/model_executor/models/siglip.py. The implementation correctly handles separate text and image inputs and reuses encoder components for both modalities. The tests are comprehensive. I have two main concerns:
- A critical performance issue in the vision tower's pooling head, which uses a non-optimized
torch.nn.MultiheadAttentioninstead of a vLLM-optimized attention backend. - A bug where the model will crash if
pooling_type='LAST'is used, due to a missing entry in the pooling strategy map. This contradicts the behavior described in the PR description.
These issues should be addressed before merging.
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
3b1e7ea to
940ce7d
Compare
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: piood <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
Purpose
Support SigLIP text and image embedding in the same model, following the same architecture as CLIP embedding support.
token_embeddingwhen callingget_input_embeddings. The rest of the text embedding and the encoder logic are applied when callingforwardon the model.get_input_embeddings. Since the model doesn't have a decoder, we directly return the embeddings inside theforwardmethod.This PR extends the multimodal embedding capabilities to support SigLIP models, which are widely used for vision-language tasks.
Related to #13663,this pr is for siglip(v1) embedding support, then will continue support siglip2.
Test Plan
tests/models/multimodal/pooling/test_siglip.pyTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.