Skip to content

Conversation

@qthequartermasterman
Copy link
Contributor

@qthequartermasterman qthequartermasterman commented Sep 4, 2025

Purpose

Fixes #22124. Fixes #19746.

Prompt Embedding inputs are a niche, but frequently asked for feature in vLLM. #15428 introduced them in the v0 engine, but they have not yet been ported to the v1 engine. Prompt embedding users will be stuck on older versions of vLLM unless the feature is also introduced into the v1 engine.

The original RFC is #22124. The design differs from that RFC in three ways:

  1. Mixed batches (of both prompt_embeds and prompt_token_ids) are handled within the GPUModelRunner.execute_model itself, where tokens that are passed in id and not prompt_embed are first transformed to embeddings, then they are sent through the model. This is in some ways similar to how multi-modal embeddings are mixed with input_ids for multi-modal models. Since the model outputs token ids anyway, it was significantly cleaner to just handle the mixing here instead of in the scheduler, like in the RFC and the v0 engine.
  2. The "double compilation" of the CUDA graph, once with input_ids and once with inputs_embeds, like in the RFC and v0 engine is eschewed, and instead, when prompt_embeds is enabled, all token_ids are transformed into embeddings first outside of the compiled graph, and then only inputs_embeds are passed in. While this has a performance hit, it is similar to how multimodal models are treated today, and it only happens when --enable-prompt-embeds is on (it's off by default). The double compilation proposal would require significant work, and was large enough while I was prototyping, I figured it would be better to do just the v1 + prompt embeds pieces first, because this PR is already large enough.
  3. This goes further than the RFC in disabling prefix_caching when enabled alongside prompt_embeds. This didn't work in v0, and still doesn't yet work in v1. Future work can enable this support. Since it's now on by default, we need to disable prefix_caching whenever --enable-prompt-embeds is on.

Test Plan

There are several unit tests already extant that test prompt embeds, but they were previously disabled on the v1 engine. I enabled those. I also added some more scenarios to the basic correctness tests to catch regressions related to tensor_parallel + prompt_embeds.

I'm also locally running a local script file based on https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/prompt_embed_inference.py against a large variety of combinations of (prompts, prompt_embeds, prompt+prompt_embeds) on many different seq_lens (ranging from very short to very long) within the same batch on a variety of settings (including eager mode on/off, chunked_prefill on/off, and various tensor parallel sizes).

Test Result

All the new tests are passing. My local script suite is also passing, and the generations look as expected on every configuration I've checked on my linux machine with two nvidia GPUs.

Pending CI test results. With any luck I didn't break anything else. 🤞


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
@mergify mergify bot added frontend v1 tpu Related to Google TPUs labels Sep 4, 2025
@WoosukKwon
Copy link
Collaborator

Can we merge after #25025? It seems the tests are related

@qthequartermasterman
Copy link
Contributor Author

@DarkLight1337 @WoosukKwon Looks like CI issue was resolved. Thanks!

@DarkLight1337
Copy link
Member

Thanks for your patience!

@DarkLight1337 DarkLight1337 merged commit 9a4600e into vllm-project:main Sep 19, 2025
47 of 48 checks passed
@DarkLight1337 DarkLight1337 added this to the v0.10.3 milestone Sep 19, 2025
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Sep 19, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Sep 19, 2025
kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Sep 19, 2025
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
qthequartermasterman added a commit to protopia-ai/vllm that referenced this pull request Sep 19, 2025
Comment on lines +815 to +816
self.inputs_embeds.copy_to_gpu(total_num_scheduled_tokens)
self.is_token_ids.copy_to_gpu(total_num_scheduled_tokens)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do these conditional on self.enable_prompt_embeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably! I'll investigate and open a follow-up PR. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#25739

Thanks for the suggestion.

assert self.aux_buffers is not None
# view the tensor as a contiguous 1D array of bytes
arr = obj.flatten().contiguous().view(torch.uint8).numpy()
arr = obj.flatten().contiguous().cpu().view(torch.uint8).numpy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for adding this?

I think this has the potential to introduce unnoticed performance regressions. It's probably better to require the tensors to already be on the CPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was originally added before I wrote #22962 which puts all tensors from users onto the CPU. I think it may be vestigial. I'll investigate and open up a follow-up PR reverting this change if it is now unneeded. 100% agree that it may have unnoticed performance regressions in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #25738. Thanks for the suggestion.

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: charlifu <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Prompt Embeddings Support in v1 Engine [Usage]: embed prompts

5 participants