[CORE] Prompt Embeddings Support for v1 Engine #24278

qthequartermasterman · 2025-09-04T21:14:01Z

Purpose

Fixes #22124. Fixes #19746.

Prompt Embedding inputs are a niche, but frequently asked for feature in vLLM. #15428 introduced them in the v0 engine, but they have not yet been ported to the v1 engine. Prompt embedding users will be stuck on older versions of vLLM unless the feature is also introduced into the v1 engine.

The original RFC is #22124. The design differs from that RFC in three ways:

Mixed batches (of both prompt_embeds and prompt_token_ids) are handled within the GPUModelRunner.execute_model itself, where tokens that are passed in id and not prompt_embed are first transformed to embeddings, then they are sent through the model. This is in some ways similar to how multi-modal embeddings are mixed with input_ids for multi-modal models. Since the model outputs token ids anyway, it was significantly cleaner to just handle the mixing here instead of in the scheduler, like in the RFC and the v0 engine.
The "double compilation" of the CUDA graph, once with input_ids and once with inputs_embeds, like in the RFC and v0 engine is eschewed, and instead, when prompt_embeds is enabled, all token_ids are transformed into embeddings first outside of the compiled graph, and then only inputs_embeds are passed in. While this has a performance hit, it is similar to how multimodal models are treated today, and it only happens when --enable-prompt-embeds is on (it's off by default). The double compilation proposal would require significant work, and was large enough while I was prototyping, I figured it would be better to do just the v1 + prompt embeds pieces first, because this PR is already large enough.
This goes further than the RFC in disabling prefix_caching when enabled alongside prompt_embeds. This didn't work in v0, and still doesn't yet work in v1. Future work can enable this support. Since it's now on by default, we need to disable prefix_caching whenever --enable-prompt-embeds is on.

Test Plan

There are several unit tests already extant that test prompt embeds, but they were previously disabled on the v1 engine. I enabled those. I also added some more scenarios to the basic correctness tests to catch regressions related to tensor_parallel + prompt_embeds.

I'm also locally running a local script file based on https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/prompt_embed_inference.py against a large variety of combinations of (prompts, prompt_embeds, prompt+prompt_embeds) on many different seq_lens (ranging from very short to very long) within the same batch on a variety of settings (including eager mode on/off, chunked_prefill on/off, and various tensor parallel sizes).

Test Result

All the new tests are passing. My local script suite is also passing, and the generations look as expected on every configuration I've checked on my linux machine with two nvidia GPUs.

Pending CI test results. With any luck I didn't break anything else. 🤞

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Andrew Sansom <[email protected]>

…xecutors in v1 engine Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

…mpt embeds Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

…tra parameters Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

…mbeds tensors instead of zeros. Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

WoosukKwon · 2025-09-18T16:03:43Z

Can we merge after #25025? It seems the tests are related

qthequartermasterman · 2025-09-18T21:11:48Z

@DarkLight1337 @WoosukKwon Looks like CI issue was resolved. Thanks!

DarkLight1337 · 2025-09-19T00:02:50Z

Thanks for your patience!

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

…m-project#24278 Signed-off-by: Andrew Sansom <[email protected]>

njhill · 2025-09-21T17:39:38Z

vllm/v1/worker/gpu_model_runner.py

+            self.inputs_embeds.copy_to_gpu(total_num_scheduled_tokens)
+            self.is_token_ids.copy_to_gpu(total_num_scheduled_tokens)


Shouldn't we do these conditional on self.enable_prompt_embeds?

Probably! I'll investigate and open a follow-up PR. Thanks!

#25739

Thanks for the suggestion.

njhill · 2025-09-21T17:44:14Z

vllm/v1/serial_utils.py

        assert self.aux_buffers is not None
        # view the tensor as a contiguous 1D array of bytes
-        arr = obj.flatten().contiguous().view(torch.uint8).numpy()
+        arr = obj.flatten().contiguous().cpu().view(torch.uint8).numpy()


What is the reason for adding this?

I think this has the potential to introduce unnoticed performance regressions. It's probably better to require the tensors to already be on the CPU.

This change was originally added before I wrote #22962 which puts all tensors from users onto the CPU. I think it may be vestigial. I'll investigate and open up a follow-up PR reverting this change if it is now unneeded. 100% agree that it may have unnoticed performance regressions in the future.

See #25738. Thanks for the suggestion.

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: charlifu <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

qthequartermasterman added 20 commits August 7, 2025 15:50

feat: ✨ do not fall back to v0 engine when enabling prompt embeds

70cf633

Signed-off-by: Andrew Sansom <[email protected]>

feat: :pipe: propagate prompt_embeds from the requests to the model e…

3f4de3f

…xecutors in v1 engine Signed-off-by: Andrew Sansom <[email protected]>

refactor: ♻️ 🚚 move length_from_prompt_tokens_or_embeds to utils

88329c5

Signed-off-by: Andrew Sansom <[email protected]>

fix: keep track of prompt embeds within input batch

3df28f4

Signed-off-by: Andrew Sansom <[email protected]>

fix: use placeholder prompt_token_ids in RequestOutput when using pro…

addcd8e

…mpt embeds Signed-off-by: Andrew Sansom <[email protected]>

feat: pass prompt embeds from InputBatch to GPU model runner

7179da1

Signed-off-by: Andrew Sansom <[email protected]>

fix: correctly place the correct indices

61b962d

Signed-off-by: Andrew Sansom <[email protected]>

Remove accidental file

d429d21

Signed-off-by: Andrew Sansom <[email protected]>

fix: do not copy token_ids->embeds into a temporary tensor when batched

9c90c87

Signed-off-by: Andrew Sansom <[email protected]>

disable prefix caching in v1 when prompt embeds is enabled.

ab3f02d

Signed-off-by: Andrew Sansom <[email protected]>

disable prefix caching in v0 when prompt embeds is enabled.

994b600

Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into enable-prompt-embeds-in-v1

4f1480f

Signed-off-by: Andrew Sansom <[email protected]>

use cpu/gpu buffer classes after merge

d6a376b

Signed-off-by: Andrew Sansom <[email protected]>

test: fix missing argument

0ef06db

Signed-off-by: Andrew Sansom <[email protected]>

test: add test cases for v1 engine + prompt embeds

7eeafac

Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into enable-prompt-embeds-in-v1

b683d49

Signed-off-by: Andrew Sansom <[email protected]>

fix: cudagraph compilation when prompt embeds are enabled.

0c0c53a

Signed-off-by: Andrew Sansom <[email protected]>

style: use old style union syntax

ce908c8

Signed-off-by: Andrew Sansom <[email protected]>

fix: always cast tensors to CPU before sending through msgpack

f499067

Signed-off-by: Andrew Sansom <[email protected]>

style: reorder prompt_embeds in OpenAI spec so that they appear in ex…

b3ae070

…tra parameters Signed-off-by: Andrew Sansom <[email protected]>

mergify bot added frontend v1 tpu Related to Google TPUs labels Sep 4, 2025

qthequartermasterman added 7 commits September 4, 2025 18:27

style: remove unnecessary TODO comment

4fc6454

Signed-off-by: Andrew Sansom <[email protected]>

style: remove unnecessary TODO comment

6d94a14

Signed-off-by: Andrew Sansom <[email protected]>

style: remove unnecessary TODO comment

3d9d400

Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into enable-prompt-embeds-in-v1

a23bc88

fix: avoid slow NCCL initialization by using unitialized CPU prompt e…

e236802

…mbeds tensors instead of zeros. Signed-off-by: Andrew Sansom <[email protected]>

test: add prompt_embeds + tensor_parallel correctness tests

6574014

Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into enable-prompt-embeds-in-v1

ce04ea0

Merge branch 'main' into enable-prompt-embeds-in-v1

1604a57

DarkLight1337 approved these changes Sep 19, 2025

View reviewed changes

DarkLight1337 merged commit 9a4600e into vllm-project:main Sep 19, 2025
47 of 48 checks passed

DarkLight1337 added this to the v0.10.3 milestone Sep 19, 2025

kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Sep 19, 2025

[Bugfix] Fix non-default argument error

234eb29

Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <[email protected]>

kyuyeunk mentioned this pull request Sep 19, 2025

[Bugfix] Fix non-default argument error vllm-project/tpu-inference#714

Closed

kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Sep 19, 2025

[Bugfix] Fix non-default argument error

acf6483

Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <[email protected]>

qthequartermasterman mentioned this pull request Sep 19, 2025

[docs] Prompt Embedding feature support #25288

Merged

5 tasks

qthequartermasterman added a commit to protopia-ai/vllm that referenced this pull request Sep 19, 2025

test: Remove vestigial skip for prompt embeds tests after landing vll…

1cfe8c2

…m-project#24278 Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman mentioned this pull request Sep 19, 2025

test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support #25291

Merged

5 tasks

njhill reviewed Sep 21, 2025

View reviewed changes

anxiang1836 mentioned this pull request Oct 16, 2025

你好，大佬，搞好了吗~可以用吗。 weng8232/async_cosyvoice2#3

Open

qthequartermasterman mentioned this pull request Oct 20, 2025

[CORE] Support Prefix Caching with Prompt Embeds #27219

Merged

5 tasks

anxiang1836 mentioned this pull request Oct 31, 2025

how to use V1 engine in vllm mode FunAudioLLM/CosyVoice#1586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CORE] Prompt Embeddings Support for v1 Engine #24278

[CORE] Prompt Embeddings Support for v1 Engine #24278

Uh oh!

qthequartermasterman commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

WoosukKwon commented Sep 18, 2025

Uh oh!

qthequartermasterman commented Sep 18, 2025

Uh oh!

DarkLight1337 commented Sep 19, 2025

Uh oh!

Uh oh!

njhill Sep 21, 2025

Uh oh!

qthequartermasterman Sep 23, 2025

Uh oh!

qthequartermasterman Sep 26, 2025

Uh oh!

njhill Sep 21, 2025

Uh oh!

qthequartermasterman Sep 23, 2025

Uh oh!

qthequartermasterman Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		self.inputs_embeds.copy_to_gpu(total_num_scheduled_tokens)
		self.is_token_ids.copy_to_gpu(total_num_scheduled_tokens)

Uh oh!

[CORE] Prompt Embeddings Support for v1 Engine #24278

[CORE] Prompt Embeddings Support for v1 Engine #24278

Uh oh!

Conversation

qthequartermasterman commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

WoosukKwon commented Sep 18, 2025

Uh oh!

qthequartermasterman commented Sep 18, 2025

Uh oh!

DarkLight1337 commented Sep 19, 2025

Uh oh!

Uh oh!

njhill Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

qthequartermasterman Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

qthequartermasterman Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

qthequartermasterman Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

qthequartermasterman Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qthequartermasterman commented Sep 4, 2025 •

edited by github-actions bot

Loading