-
-
Notifications
You must be signed in to change notification settings - Fork 11k
feat: spec decode with draft models #24322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tomasruizt
wants to merge
95
commits into
vllm-project:main
Choose a base branch
from
tomasruizt:feature/spec-decode-draft-model
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+772
−84
Open
Changes from 2 commits
Commits
Show all changes
95 commits
Select commit
Hold shift + click to select a range
f27261a
Speculative Decoding with Draft Model
tomasruizt 3b06a7c
Unod change to 'vllm bench throughput'
tomasruizt e41b0a3
Don't return too early
tomasruizt 10366b9
Undo change to bind_kv_cache()
tomasruizt 92af339
Undo changes to pyproject.toml
tomasruizt 5b8b1c6
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt f2f9876
Simplify test array
tomasruizt 824ba10
Ensure EAGLE loads correctly
tomasruizt 5e248c1
Pass input_embeds when model is multimodal
tomasruizt 1669ea7
Raise NotImplementedError on Mrope or Multimodal models
tomasruizt 6040697
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 4b77a83
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 5a6cc82
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 54e107d
Speculative decoding with draft model separate from EAGLE
tomasruizt 134b841
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 36fb940
Pass last_token_indices
tomasruizt b018560
Undo unnecessary changes
tomasruizt 17e9fe5
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt daee8ec
Move more methods to base class
tomasruizt b45f7af
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 07d1b97
Fix call to model.compute_logits()
tomasruizt 86d8040
Move .propose() to superclass
tomasruizt a696797
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 1afbe14
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt d37d780
Minimize git diffs in EAGLE
tomasruizt 5967e09
Fix missing input
tomasruizt 7b03a45
fix next_token_ids issue
benchislett 35fa5a9
Merge pull request #3 from CentML/spec-decode-draft-model
tomasruizt ef5da86
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt c7d2fd5
Test also acceptance-len
tomasruizt ac90311
Pass missing argument in test_eagle.py
tomasruizt 857415b
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt b477e10
CKPT: Remove extra forward
tomasruizt 309d827
Prevent illegal access to hidden_states
tomasruizt 2e97fab
Remove forward. single prompt works. Batch fails
tomasruizt 794c3cf
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt 89b9c1d
Remove unnecessary if-else statement
tomasruizt c767118
Merge branch 'feature/spec-decode-draft-model' into featury/remove-ex…
tomasruizt e74c71e
Minimize changes
tomasruizt 994e9cc
Commit unit test success
tomasruizt 26ab913
Remove unnecessary variables
tomasruizt 01dd981
Minimize changes
tomasruizt 09a0bb3
Remove token logging
tomasruizt 42faf1c
Relocate utility method
tomasruizt 044e45c
Simplify extend_flat_seqs()
tomasruizt 7a1949d
Document test
tomasruizt 316a6b8
Document funcs
tomasruizt 0e75db7
Merge pull request #5 from tomasruizt/featury/remove-extra-forward
tomasruizt af06030
Update BatchDescriptor with correct num_tokens
tomasruizt a791d2e
Make sure AL benchmark can run
tomasruizt 1de5ef4
Extend drafter max_num_tokens
tomasruizt 4371d47
CKPT: Find bug affecting acceptance length
tomasruizt 1718892
Fix AL for default drafter padding
tomasruizt ac56891
Remove logging
tomasruizt 4b43999
use non-blocking cpu move, document and test helper fns
tomasruizt 10eb718
Minimize changes
tomasruizt 4c7eb11
Reduce changes footprint
tomasruizt d123018
Reduce changes
tomasruizt 02872ad
Minimize changes
tomasruizt 50ae07f
Merge commit '17edd8a' into feature/spec-decode-draft-model
tomasruizt 33bcc08
ruff
tomasruizt fa99c05
Merge commit 'd6953be' into feature/spec-decode-draft-model
tomasruizt eac09d2
Get AL high again
tomasruizt ccac6cb
Minimze changes
tomasruizt 2ba8c5a
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt c094f5f
Add flag for disable_padded_drafter_batch
tomasruizt a6f8484
Correct typo
tomasruizt 4e77a80
Ensure draft model uses CUDA graph
tomasruizt a1e899c
Remove unnecessary cudagraph inputs
tomasruizt 50dcbc4
Minimize changes
tomasruizt c01e43b
Minimize changes
tomasruizt cf99760
Remove unused fn
tomasruizt c73929d
Minimize changes
tomasruizt 66d4f2b
Avoid OOB error on large batches
tomasruizt c27b6a7
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt de86231
Simplify away passing the CUDA graph args
tomasruizt f8321d2
add option --max-num-seqs to spec_decode.py (useful for small GPUs)
tomasruizt e9560ef
Prevent different tokenizer vocab sizes
tomasruizt 694faf8
Limit cudagraph capture time in test
tomasruizt fa6294f
Minimize changes related to CUDA graph
tomasruizt c9ff19a
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt f49a5ea
Replace Optional[T] with T | None
tomasruizt 37f013e
Add tests for quantized target / draft model
tomasruizt 58f8496
Add test for draft model + tensor parallelism
tomasruizt 4bd9a46
Log why endpoint is not ready
tomasruizt ff92d85
Test tensor parallelism more thoroughly
tomasruizt c135ae1
Reject draft TP > 1
tomasruizt 7c011c0
Enforce same TP for draft & target
tomasruizt 02d9d86
Explicitly set rank for draft TP
tomasruizt 14946cd
Document why we enforce equal TP
tomasruizt e1dbab1
Simplify changes. Improve docs
tomasruizt f346cfa
Merge pull request #6 from tomasruizt/feature/correct-tensor-parallel…
tomasruizt 4641ec6
Simplify tests
tomasruizt ea3bb0a
Reject draft models with multiple kv-cache groups
tomasruizt 6ca55ab
Merge branch 'main' into feature/spec-decode-draft-model
tomasruizt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,6 @@ | ||
| # Scripts for development | ||
| scripts/ | ||
|
|
||
| # version file generated by setuptools-scm | ||
| /vllm/_version.py | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.