Skip to content

Conversation

@ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented Sep 30, 2025

Fixes #20780

E3 has 20% better AL than E1 but the e2e TOPS was just 4%. The expectation was to see atleast 20% better e2e gains.
The issue was traced to data in benchmark having a very small difference which lead to this huge difference. Offline inference gave the AL of 2.79 for E3 on MTBench. However, the AL reported in Online (after hacking it to give the overall AL and not snapshot of AL by bypassing the reset of the prometheus metric) was ~2.2. This happened because both offline and online inference share the same dataset but offline sets add_special_tokens to False whereas online was setting it to True for llama 3.1 model based on model config.

This meant the prompt being used in online serving had <|begin_of_text|> twice in the beginning, one from the chat template and one from the tokenizer.encode with add_special_tokens as True. This very small difference was enough to throw the E3 off balance and we see such sharp drop in AL. This sudden drop is not seen in E1 which is why this discrepancy in data bw online and offline inference was never discovered during E1 ablations.

The right way is to skip chat template in the dataset builder and use /v1/chat/completitions endpoint as done below.

cmd:
server
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 9001 --speculative_config '{"method": "eagle3","model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}'

client
vllm bench serve --port 9001 --save-result --save-detailed --model meta-llama/Llama-3.1-8B-Instruct --temperature=0.0 --top-p=1.0 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 4 --result-dir "./throwaway" --endpoint "/v1/chat/completions" --backend openai-chat --skip-chat-template

TPOT on MTBench BS4 on H100

  • E1: 4.32ms
  • E3 before this PR: 4.09ms (5.6% faster than E1)
  • E3 with this PR: 3.25ms (32.9% faster than E1)

@mergify mergify bot added the performance Performance-related issues label Sep 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --trim-special-tokens flag to handle cases where special tokens are added twice, which was impacting benchmark performance. The core logic is in the new normalize function. While the intent is correct, the implementation has critical flaws: it includes an incorrect assertion that will crash with some tokenizers and it doesn't fully implement the described suffix trimming. I've provided a refactored implementation to address these correctness issues, making the solution more robust.

@ekagra-ranjan ekagra-ranjan changed the title fix data EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench Sep 30, 2025
Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a simpler solution here be to simply use the chat completions API for backends that have a conversations input? Is this possible in the vllm benchmarking framework?

@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Sep 30, 2025

vllm benchmark only supports /v1/completitions as of now. I can take a look if it can be extended to /v1/chat/completitions to avoid these changes.

Signed-off-by: Ekagra Ranjan <[email protected]>
@ekagra-ranjan
Copy link
Contributor Author

@benchislett - updated to use /v1/chat/completitions API correctly.

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I much prefer this feature. I am in favor of going even further and making the chat completions endpoint the default behaviour, but that might have some consequences. Anyone else have thoughts on this?

Signed-off-by: Ekagra Ranjan <[email protected]>
@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Sep 30, 2025

@benchislett - can you add the ready tag and auto-merge on the PR so that CI can run?

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
@benchislett benchislett enabled auto-merge (squash) September 30, 2025 23:53
@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Oct 1, 2025

v1/spec_decode/test_eagle.py:373: is failing here with
FAILED v1/spec_decode/test_eagle.py::test_load_model[True-1-FLASH_ATTN-eagle] - RuntimeError: generator raised StopIteration. This PR doesnt touch that code so probably not related to this PR.

@simon-mo simon-mo disabled auto-merge October 2, 2025 18:29
@simon-mo simon-mo merged commit 1cab2f9 into vllm-project:main Oct 2, 2025
47 of 49 checks passed
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…2% instead of 5% on MTBench (#25916)

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
…2% instead of 5% on MTBench (vllm-project#25916)

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
…2% instead of 5% on MTBench (vllm-project#25916)

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Karan Goel <[email protected]>
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…2% instead of 5% on MTBench (vllm-project#25916)

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
sducouedic pushed a commit to sducouedic/vllm that referenced this pull request Oct 16, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…2% instead of 5% on MTBench (vllm-project#25916)

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Eagle 3 has 20-30% higher AL but only ~5% faster than Eagle 1

3 participants