Feature/vllm/input embedding completion api #17590

Nan2018 · 2025-05-02T14:49:16Z

adds support for passing prompt_embeds as b64 encoded bytes to the completions api.

Start the server with

VLLM_USE_V1=0 vllm serve HuggingFaceH4/zephyr-7b-beta

query example:

url = "http://localhost:8000/v1/completions"

prompt_embeds = []
for input_embeds in inputs_embeds: # inputs_embeds is list of 2d tensors of shape (seq_len, embed_dim)
    buff = io.BytesIO()
    torch.save(input_embeds.detach().cpu(), buff)
    prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8"))

resps = requests.post(
    url,
    json={
        "model": request_model,
        "prompt_embeds": prompt_embeds,
    },
)

# or with openai client

completions = client.completions.create(
    model=request_model,
    prompt="",
    extra_body={"prompt_embeds": prompt_embeds},
)

Note

this does not work with lora or prompt adapters

Signed-off-by: Andrew Sansom <[email protected]>

Co-authored-by: Nan2018 <[email protected]> Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

…mpty tensors instead of none Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

…oid having two vLLM instances in memory at once Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

…ion endpoint while remaining type safe for non-completions endpoints Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

Nan2018 · 2025-05-15T20:33:51Z

I spent the better part of the afternoon trying to pin point exactly where the failure is occuring. It seems to be only affecting vLLM instances launched via subprocess (in particular via tests.utils.RemoteOpenAIServer in the tests). Launching a vLLM instance normally with the exact same environment variables and arguments works perfectly...

@DarkLight1337 any ideas about this? Do you think it is a blocker for this pr?

DarkLight1337 · 2025-05-16T09:40:07Z

I'm fine with not supporting LoRA for now, unless LoRA is a very important use case for this.

DarkLight1337 · 2025-05-16T09:43:43Z

Can you add an example script to the documentation for both offline and online inference?

Signed-off-by: Andrew Sansom <[email protected]>

… engine is chosen implicitly Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman · 2025-05-16T16:17:30Z

I'm fine with not supporting LoRA for now, unless LoRA is a very important use case for this.

I don't think this is an important use case at this time. I think it only came up because the existing completion tests checked for LoRA compatibility and @Nan2018 tried to use both of them together.

Can you add an example script to the documentation for both offline and online inference?

I added the docs/source/serving/prompt_embeds.md file. I don't need to add anything to add the page to the sphinx site, correct? It'll automatically find it? I couldn't find anything in the sphinx documentation explicitly mentioning the other files in that directory.

DarkLight1337 · 2025-05-16T16:21:19Z

Yeah they should be added automatically

Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman · 2025-05-17T02:37:24Z

@DarkLight1337 It looks like docs build timed out. All of the fast checks are passing. I do think this PR is ready for review.

Thanks for your help with this!

DarkLight1337 · 2025-05-19T03:10:17Z

Regarding the subprocess issue, it may be related to #18308 (comment)

DarkLight1337

Let's merge this first though

CandiedCode · 2025-05-20T16:41:19Z

@DarkLight1337 will this make it into the v0.9.0 release?

DarkLight1337 · 2025-05-20T16:44:51Z

Yes

Potabk · 2025-05-23T03:43:41Z

tests/entrypoints/openai/test_completion_with_prompt_embeds.py

+
+
+@pytest.fixture(scope="module")
+def zephyr_lora_added_tokens_files(zephyr_lora_files):


What is this lora module used for?

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Nan2018 <[email protected]> Co-authored-by: 临景 <[email protected]> Co-authored-by: Bryce1010 <[email protected]> Co-authored-by: Andrew Sansom <[email protected]> Co-authored-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

临景 and others added 30 commits April 2, 2025 14:37

(vllm) add input embedding

cef6894

Signed-off-by: Andrew Sansom <[email protected]>

improve embedding input

c51d8fb

Signed-off-by: Andrew Sansom <[email protected]>

(vllm) fix import error

9564b40

Signed-off-by: Andrew Sansom <[email protected]>

(vllm) fix pre commit error

c60298a

Signed-off-by: Andrew Sansom <[email protected]>

apply ruff and isort fixes

0c24a82

Signed-off-by: Andrew Sansom <[email protected]>

apply ruff and isort fixes

403a165

Signed-off-by: Andrew Sansom <[email protected]>

styling

b1ac072

Signed-off-by: Andrew Sansom <[email protected]>

fix missing imports from rebase

0390c33

Signed-off-by: Andrew Sansom <[email protected]>

typing fixes

0ca4dae

Signed-off-by: Andrew Sansom <[email protected]>

type fix

35320fe

Signed-off-by: Andrew Sansom <[email protected]>

type fix

0a77630

Signed-off-by: Andrew Sansom <[email protected]>

remove unnecessary changes

11b6c02

Signed-off-by: Andrew Sansom <[email protected]>

remove unnecessary changes

cb92a3d

Signed-off-by: Andrew Sansom <[email protected]>

re-add deleted whitespace

375bd5b

Signed-off-by: Andrew Sansom <[email protected]>

Include unit tests from vllm-project#6869.

c9d8024

Co-authored-by: Nan2018 <[email protected]> Signed-off-by: Andrew Sansom <[email protected]>

remove unrelated qwen2 changes

a64e627

Signed-off-by: Andrew Sansom <[email protected]>

guard clause around fully consumed prompt embeds to avoid returning e…

6ab349e

…mpty tensors instead of none Signed-off-by: Andrew Sansom <[email protected]>

use v0 for prompt embeds model runner tests

26c8784

Signed-off-by: Andrew Sansom <[email protected]>

fix batching of input embeddings

b71a13c

Signed-off-by: Andrew Sansom <[email protected]>

style formatting

4aa9ade

Signed-off-by: Andrew Sansom <[email protected]>

remove incorrect overload

e2c4c26

Signed-off-by: Andrew Sansom <[email protected]>

remove incorrect overload

26d108a

Signed-off-by: Andrew Sansom <[email protected]>

Update representations

af20435

Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

remove unrelated changes to docs

25aaf3f

Signed-off-by: Andrew Sansom <[email protected]>

remove unrelated typing change

bc05860

Signed-off-by: Andrew Sansom <[email protected]>

fix missing syntax

b55800d

Signed-off-by: Andrew Sansom <[email protected]>

do not schedule prompt embeds and non-prompt embeds in the same batch

be42a17

Signed-off-by: Andrew Sansom <[email protected]>

fix style linelength

c8fcfe4

Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into feature/vllm/add-input-embedding

b21688f

propogate embeddings for sampled output tokens for decoding

1e359ae

Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman added 7 commits May 13, 2025 20:38

enable lora with prompt embeds

56f10df

Signed-off-by: Andrew Sansom <[email protected]>

disable chunked prefill in openai + prompt embeds checks

92b336a

Signed-off-by: Andrew Sansom <[email protected]>

move prompt embeds completions endpoint tests to their own file to av…

72674e0

…oid having two vLLM instances in memory at once Signed-off-by: Andrew Sansom <[email protected]>

allow mixed embeds/text prompts to completions endpoint

7134fe1

Signed-off-by: Andrew Sansom <[email protected]>

refactor serving engine to allow mixed embeds/text prompts to complet…

38c366d

…ion endpoint while remaining type safe for non-completions endpoints Signed-off-by: Andrew Sansom <[email protected]>

remove vestigial comments

a56b7f4

Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into feature/vllm/input-embedding-completion-api

db62b8c

Nan2018 requested a review from DarkLight1337 May 15, 2025 20:31

qthequartermasterman added 3 commits May 16, 2025 11:11

add documentation for serving prompt embeddings

8c1dde9

Signed-off-by: Andrew Sansom <[email protected]>

remove explicit dependence on v0 for prompt embeddings test since the…

204952c

… engine is chosen implicitly Signed-off-by: Andrew Sansom <[email protected]>

Merge branch 'main' into feature/vllm/input-embedding-completion-api

9396f8a

mergify bot added the documentation Improvements or additions to documentation label May 16, 2025

add prompt embeds docs to toctree

1351bdd

Signed-off-by: Andrew Sansom <[email protected]>

DarkLight1337 approved these changes May 19, 2025

View reviewed changes

vllm-bot merged commit 221cfc2 into vllm-project:main May 19, 2025
12 of 13 checks passed

Potabk mentioned this pull request May 19, 2025

[Doc] Fix prompt embedding examples #18350

Merged

Potabk reviewed May 23, 2025

View reviewed changes

qthequartermasterman mentioned this pull request Aug 2, 2025

[RFC]: Prompt Embeddings Support in v1 Engine #22124

Closed

1 task



		@pytest.fixture(scope="module")
		def zephyr_lora_added_tokens_files(zephyr_lora_files):

Uh oh!

Feature/vllm/input embedding completion api #17590

Feature/vllm/input embedding completion api #17590

Uh oh!

Conversation

Nan2018 commented May 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nan2018 commented May 15, 2025

Uh oh!

DarkLight1337 commented May 16, 2025

Uh oh!

DarkLight1337 commented May 16, 2025

Uh oh!

qthequartermasterman commented May 16, 2025

Uh oh!

DarkLight1337 commented May 16, 2025

Uh oh!

qthequartermasterman commented May 17, 2025

Uh oh!

DarkLight1337 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CandiedCode commented May 20, 2025

Uh oh!

DarkLight1337 commented May 20, 2025

Uh oh!

Potabk May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Nan2018 commented May 2, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented May 19, 2025 •

edited

Loading