-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Fix verify tokens with the correct bonus token #8320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
Hey, thanks for the interest! I want to align some definition here:
That being said, if the Please let me know if there is any confusion here! Sorry for different terms here, we might simplify it in the future. |
|
After second thoughts, I feel this is a good optimization. We can actually skip rejection sampling in the greedy decoding case, which explains the speedup you get. |
Thx Xiaoxuan, so do you think this optimization idea is qualified to merge to vllm. We treat this as a platform independent optimization(orthogonal w/ backend optimizations like flashinfer), which can benefit other device backends like CPU/XPU, and we see similar performance issue. |
|
This change is to align the verify token function to the transformers speculative sampling algorithm, it always selectes the next sample token from the target model. |
From reading the code, it seems it adjusts the distribution and resamples (line 4195 - 4203)? |
Could you double check the correctness here? If the optimization can pass the rejection sampling tests, yeah happy to review and get it in. |
Yes, but the point is the output will always contain the first unmatched token
The error comes from sampling. We can not guarantee the output will be all matched even if the target model is the same as the draft model because sampling will introduce random factors. So I disabled sampling by setting |
This is the intention because you cannot force users to set temperature=0. That's why @LiuXiaoxuanPKU suggested we could bypass rejected sampling when temperature=0, but we cannot remove rejected sampling. |
We cannot expect the output length of speculative decoding to be a fixed number if sampling is applied, it could be in |
|
OK, I removed |
|
I'm confused. It seems this PR removes rejection sampling. Then how do you do speculative decoding with temperature != 0? |
I didn't remove rejection sampling, just removed recovered_token since it is not needed. The newly defined bonus_token covers the recovered_token case. Besides, I kept the temperature in default value in the test. |
|
The recovered token should not be from the target token ids if temperature != 0. It should be sampled from a new distribution. Correct me if I misunderstand anything. |
I see your point, the I know adjusting the distribution of Please let me know if you want me to revert it to keep the original |
I don't think change original paper algorithm is a good idea without data proving, and I don't think change behavior is the target of this PR. This PR's target is performance optimization. @jiqing-feng, pls only optimize performance for temperature == 0, and don't change the logic of others. |
|
I have reverted unnecessary changes. Now, the rejection sampling exactly follows the paper and is the same as transformers integration. |
Yes, you were right. I have fixed the recovered token by selecting it from the new distribution (torch.clamp(target_prob-draft_prob), min=0). Please take a review. Thx. |
|
Hi @LiuXiaoxuanPKU . I checked the rejection sampler codes in detail and found there is no need to change it because you can get the correct recovered token ids. Only 1 thing: I am okay with the little difference btw vllm and the original paper because I cannot get speed-up on rejection sampling with this PR, but it could get significant speed-up on typical acceptance sampler. So I opened another PR to only change typical acceptance sampler, see #8562 . Please take a review, thx. |

The real bonus token should be the first unmatched token. For example, the
draft_token_idsis[1, 2, 3, 5, 7], and thetarget_token_idsis[1, 2, 3, 4, 6, 8]. Then, the matched token should be[1, 2, 3], and the bonus token should be[4]because the target model will output[4]based on the input[1, 2, 3]in the next round generation. So we can select the bonus token[4]in this round without any precision regression.It will increase the performance from 89 tokens/s to 110 tokens/s in
typical_acceptance_samplerin A100 single card with (num_speculative_tokens=2, max_num_seqs=1, model="meta-llama/Llama-2-7b-chat-hf", speculative_model="Felladrin/Llama-68M-Chat-v1"). The outputs are exactly the same before and after my changes.Do you mind review this PR? @cadedaniel
cc @LiuXiaoxuanPKU