Skip to content

Conversation

@piood
Copy link
Contributor

@piood piood commented Oct 22, 2025

Purpose

Support SigLIP text and image embedding in the same model, following the same architecture as CLIP embedding support.

  • For text inputs, we only apply token_embedding when calling get_input_embeddings. The rest of the text embedding and the encoder logic are applied when calling forward on the model.
  • For image inputs, we apply vision embeddings when calling get_input_embeddings. Since the model doesn't have a decoder, we directly return the embeddings inside the forward method.
  • Unlike CLIP, SigLIP uses an encoder-only architecture. To disable prefix_cache for SigLIP, as mentioned in [Model] Siglip Embedding Support #27324 (comment), use CLS as the default pooling type.

This PR extends the multimodal embedding capabilities to support SigLIP models, which are widely used for vision-language tasks.

Related to #13663,this pr is for siglip(v1) embedding support, then will continue support siglip2.

Test Plan

  • Added dedicated tests in tests/models/multimodal/pooling/test_siglip.py
  • Updated model registry to include SigLIP embedding support
  • Added examples for both offline inference and online serving
  • Verified with local test runs

Test Result

  • All new SigLIP-specific tests pass
  • Model registry correctly recognizes SigLIP embedding models
  • Both text and image embedding generation work as expected

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Oct 22, 2025

Documentation preview: https://vllm--27324.org.readthedocs.build/en/27324/

@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models v1 labels Oct 22, 2025
@mergify
Copy link

mergify bot commented Oct 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @piood.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for SigLIP text and image embedding models, which is a great extension of vLLM's multimodal capabilities. The implementation follows the existing architecture for CLIP, including separate handling of text and image inputs. The changes are well-structured, with new tests, examples, and updates to the model registry. I have one suggestion regarding the pooling mechanism to improve performance and memory efficiency.

@piood piood mentioned this pull request Oct 22, 2025
1 task
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

kv_cache_config=kv_cache_config,
max_model_len=self.max_model_len,
enable_caching=self.cache_config.enable_prefix_caching,
enable_caching=enable_caching,
Copy link
Member

@DarkLight1337 DarkLight1337 Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noooop shouldn't enable prefix caching be disabled for encoder-only models already? Why do we still need this?

Copy link
Collaborator

@noooop noooop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution

Copy link
Contributor Author

@piood piood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed get_num_image_tokens with detailed documentation. All other issues have been addressed. Ready for re-review.

@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for SigLIP text and image embedding models. The changes include adding the model to the registry, providing example usage scripts, and implementing the model logic in vllm/model_executor/models/siglip.py. The implementation correctly handles separate text and image inputs and reuses encoder components for both modalities. The tests are comprehensive. I have two main concerns:

  1. A critical performance issue in the vision tower's pooling head, which uses a non-optimized torch.nn.MultiheadAttention instead of a vLLM-optimized attention backend.
  2. A bug where the model will crash if pooling_type='LAST' is used, due to a missing entry in the pooling strategy map. This contradicts the behavior described in the PR description.
    These issues should be addressed before merging.

@piood piood force-pushed the support-siglip-emb branch from 3b1e7ea to 940ce7d Compare October 23, 2025 16:31
@mergify mergify bot removed the needs-rebase label Oct 23, 2025
Signed-off-by: piood <[email protected]>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 23, 2025 17:04
auto-merge was automatically disabled October 23, 2025 17:14

Head branch was pushed to by a user without write access

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 23, 2025 17:17
@DarkLight1337 DarkLight1337 merged commit 0552cfb into vllm-project:main Oct 23, 2025
55 checks passed
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
@piood piood mentioned this pull request Oct 27, 2025
5 tasks
@piood piood changed the title [Model] Siglip Embedding Support [Model] Siglip2 Embedding Support Oct 27, 2025
@piood piood changed the title [Model] Siglip2 Embedding Support [Model] Siglip Embedding Support Oct 27, 2025
@noooop noooop mentioned this pull request Oct 31, 2025
7 tasks
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants