-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[CORE] Prompt Embeddings Support for v1 Engine #24278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Prompt Embeddings Support for v1 Engine #24278
Conversation
Signed-off-by: Andrew Sansom <[email protected]>
…xecutors in v1 engine Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
…mpt embeds Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
…tra parameters Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
…mbeds tensors instead of zeros. Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
|
Can we merge after #25025? It seems the tests are related |
|
@DarkLight1337 @WoosukKwon Looks like CI issue was resolved. Thanks! |
|
Thanks for your patience! |
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <[email protected]>
Fix bug due to vllm-project/vllm#24278 Signed-off-by: Kyuyeun Kim <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
…m-project#24278 Signed-off-by: Andrew Sansom <[email protected]>
| self.inputs_embeds.copy_to_gpu(total_num_scheduled_tokens) | ||
| self.is_token_ids.copy_to_gpu(total_num_scheduled_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do these conditional on self.enable_prompt_embeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably! I'll investigate and open a follow-up PR. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
| assert self.aux_buffers is not None | ||
| # view the tensor as a contiguous 1D array of bytes | ||
| arr = obj.flatten().contiguous().view(torch.uint8).numpy() | ||
| arr = obj.flatten().contiguous().cpu().view(torch.uint8).numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for adding this?
I think this has the potential to introduce unnoticed performance regressions. It's probably better to require the tensors to already be on the CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was originally added before I wrote #22962 which puts all tensors from users onto the CPU. I think it may be vestigial. I'll investigate and open up a follow-up PR reverting this change if it is now unneeded. 100% agree that it may have unnoticed performance regressions in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #25738. Thanks for the suggestion.
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Andrew Sansom <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
Fixes #22124. Fixes #19746.
Prompt Embedding inputs are a niche, but frequently asked for feature in vLLM. #15428 introduced them in the v0 engine, but they have not yet been ported to the v1 engine. Prompt embedding users will be stuck on older versions of vLLM unless the feature is also introduced into the v1 engine.
The original RFC is #22124. The design differs from that RFC in three ways:
--enable-prompt-embedsis on (it's off by default). The double compilation proposal would require significant work, and was large enough while I was prototyping, I figured it would be better to do just the v1 + prompt embeds pieces first, because this PR is already large enough.--enable-prompt-embedsis on.Test Plan
There are several unit tests already extant that test prompt embeds, but they were previously disabled on the v1 engine. I enabled those. I also added some more scenarios to the basic correctness tests to catch regressions related to tensor_parallel + prompt_embeds.
I'm also locally running a local script file based on https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/prompt_embed_inference.py against a large variety of combinations of (prompts, prompt_embeds, prompt+prompt_embeds) on many different seq_lens (ranging from very short to very long) within the same batch on a variety of settings (including eager mode on/off, chunked_prefill on/off, and various tensor parallel sizes).
Test Result
All the new tests are passing. My local script suite is also passing, and the generations look as expected on every configuration I've checked on my linux machine with two nvidia GPUs.
Pending CI test results. With any luck I didn't break anything else. 🤞
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.