Support tensor parallel #2

zhuohan123 · 2023-02-28T08:40:38Z

TODOs:

In another PR:

Merge QKV into one.

WoosukKwon

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

Current master (python server.py --model facebook/opt-13b)

# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'

4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)

# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)

# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

cacheflow/utils.py

cacheflow/models/memory_analyzer.py

cacheflow/models/model_utils.py

cacheflow/models/opt.py

server.py

cacheflow/models/opt.py

cacheflow/worker/controller.py

cacheflow/worker/worker.py

zhuohan123 · 2023-03-21T09:36:06Z

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

WoosukKwon

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

cacheflow/models/model_utils.py

ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution instead of softening it (dividing by tau<1.0 doubles logit magnitudes). This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance stuck at ~0.7038 (average p_target). FIXES: 1. Config defaults (config.py, arg_utils.py): - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range) - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens) At draft_temp=0.05: - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!) - After: tau_q = max(0.05+0.25, 2.0) = 2.0 (2x softer) 2. Force min_keep=2 in nucleus (eagle.py line 271): - Added keep_sorted[..., :2] = True - Prevents survivors=1 by construction (defensive programming) 3. Fix smoothing to uniform over kept set (eagle.py lines 275-287): - Before: Mixed with untempered baseline (wrong approach) - After: Uniform distribution over survivors only (correct) - Prevents q from reaching exactly 1.0 in corner cases 4. Remove dead code (eagle.py line 322): - Deleted unused self._current_sampling_metadata assignment - No longer needed with draft-anchored approach (bug #2 fix) Expected results: - tau_q ≥ 2.0 at ultracold temps → softer distribution - NUC_DEBUG: survivors = hundreds/thousands (not 1-2) - Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0) - Accept rate: dynamic range restored across temp sweep

Gpt oss multi lora

Signed-off-by: Nick Hill <[email protected]>

support cp for flashinfer-GQA

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>

Fixes for support_materials/2-tilelang/

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>

fix mtp config and padding

@wuhang2014

* # This is a combination of 6 commits. # This is the 1st commit message: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#2: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#3: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#4: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#5: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#6: mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> fix comments * Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Co-authored-by: Copilot <[email protected]> * Apply suggestion from @wuhang2014 line length format * Apply suggestion from @wuhang2014 remove extra empty line --------- Signed-off-by: CHEN <[email protected]> Co-authored-by: wuhang <[email protected]> Co-authored-by: Copilot <[email protected]>

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>

enable early exit for fused_moe_lora

- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Jscaldwell55 <[email protected]>

- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. Signed-off-by: Jscaldwell55 <[email protected]>

* remove contiguous Signed-off-by: mayuyuace <[email protected]> * remove comment Signed-off-by: mayuyuace <[email protected]> --------- Signed-off-by: mayuyuace <[email protected]>

zhuohan123 added 9 commits February 28, 2023 01:30

copy code from fairseq

e8d661c

remove files from fairscale

827f85f

copy files from megatron

76ed019

[WIP] add distributed init

55e5d86

Parallelize the Transformer layers

7100db2

Load weight on a single GPU

1e86393

support multi-gpu tensor parallelism

90970e1

support tensor parallelism on multiple gpus

88960f7

fix correctness

900eace

zhuohan123 changed the title ~~[WIP] Support tensor parallel~~ Support tensor parallel Mar 9, 2023

zhuohan123 added 6 commits March 17, 2023 14:05

Merge branch 'main' into tensor_parallel

6a6f7cc

fix merging errors

d5a70ab

add filelock

60bf11e

support parallel decoding

a7be5b8

update readme

538d067

remove unused files

893d4b3

zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51

fix loading for large models

e0f9f48

WoosukKwon reviewed Mar 21, 2023

View reviewed changes

zhuohan123 added 5 commits March 21, 2023 03:24

Fix some smaller issues raised by Woosuk first.

6ef5111

Fix more review issues

6727083

remove duplicate set_seed

ddc1ab0

Support the case where embedding_size != hidden_size

1d532c5

Resolve comments on weight loading and device id comments.

64e3950

WoosukKwon approved these changes Mar 21, 2023

View reviewed changes

cacheflow/models/model_utils.py Show resolved Hide resolved

WoosukKwon merged commit 2f49f15 into main Mar 21, 2023

zhuohan123 deleted the tensor_parallel branch June 18, 2023 07:22

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

Danielkinz mentioned this pull request Aug 15, 2023

[Feature | CI] Added a github action to build wheels #746

Merged

dcmaddix referenced this pull request in dcmaddix/vllm Oct 5, 2025

Merge pull request #2 from dcmaddix/gpt_oss_multi_lora

814cbcc

Gpt oss multi lora

Jialin mentioned this pull request Oct 7, 2025

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

Open

1 task

vllm-bot pushed a commit that referenced this pull request Oct 9, 2025

[Bugfix] Catch and log invalid token ids in detokenizer #2 (#26445)

bb6d8c2

Signed-off-by: Nick Hill <[email protected]>

zhangsicheng5 pushed a commit to zhangsicheng5/vllm that referenced this pull request Oct 9, 2025

Merge pull request vllm-project#2 from pisceskkk/long_seq_dev

3f73536

support cp for flashinfer-GQA

Jialin mentioned this pull request Oct 9, 2025

[Core] Bookkeeping optimization: Batchify updates 1D numpy arrays (e.g. num_tokens, num_tokens_no_spec) #25801

Open

5 tasks

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

3a08e80

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

tina0852 mentioned this pull request Oct 11, 2025

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Open

1 task

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

df36514

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

1e126aa

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>

IwakuraRein pushed a commit to IwakuraRein/vllm that referenced this pull request Oct 21, 2025

Merge pull request vllm-project#2 from vllm-project/fix-2-tilelang

6c4e0e8

Fixes for support_materials/2-tilelang/

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

Moondon69 mentioned this pull request Oct 23, 2025

[Bug]: vLLM crashes with SIGABRT on Intel Arc B-series (Battlemage) GPUs during model inspection #27408

Closed

1 task

Flink-ddd mentioned this pull request Oct 23, 2025

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Merged

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

e517640

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

19f414a

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

2255ae6

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

aedf64a

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>

wangln19 pushed a commit to wangln19/vllm that referenced this pull request Oct 27, 2025

Merge pull request vllm-project#2 from luccafong/mtp_config_enablement

618d877

fix mtp config and padding

FragranceHUST mentioned this pull request Nov 5, 2025

[Bug]: EngineCore died unexpectedly When Inference llama(generate) #23517

Open

1 task

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix] Catch and log invalid token ids in detokenizer vllm-project#2 (

c0bdebd

vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

codeman38 mentioned this pull request Nov 10, 2025

[Bug]: Prefix caching leads to different outputs for Hermes-3-Llama-3.1-8B #28317

Open

access2rohit pushed a commit to access2rohit/vllm that referenced this pull request Nov 11, 2025

Merge pull request vllm-project#2 from gnovack/marlin_experts_mxfp4

f167469

enable early exit for fused_moe_lora

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support tensor parallel #2

Support tensor parallel #2

Uh oh!

zhuohan123 commented Feb 28, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Support tensor parallel #2

Support tensor parallel #2

Uh oh!

Conversation

zhuohan123 commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhuohan123 commented Feb 28, 2023 •

edited

Loading