Skip to content

Conversation

@zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Feb 28, 2023

TODOs:

  • Parallel embedding and softmax.
  • Merge with the main branch.
  • Modify README.
  • Remove unused codes.
  • Fix the bug that downloads the weight twice.
  • Test with larger models.

In another PR:

  • Merge QKV into one.

@zhuohan123 zhuohan123 changed the title [WIP] Support tensor parallel Support tensor parallel Mar 9, 2023
@zhuohan123 zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

  • Current master (python server.py --model facebook/opt-13b)
# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'
  • 4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)
# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"
  • 8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)
# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

@zhuohan123
Copy link
Member Author

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

yuz207 referenced this pull request in IluvatarLabs/vllm Sep 30, 2025
ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution
instead of softening it (dividing by tau<1.0 doubles logit magnitudes).
This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance
stuck at ~0.7038 (average p_target).

FIXES:

1. Config defaults (config.py, arg_utils.py):
   - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range)
   - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens)

   At draft_temp=0.05:
   - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!)
   - After:  tau_q = max(0.05+0.25, 2.0)  = 2.0  (2x softer)

2. Force min_keep=2 in nucleus (eagle.py line 271):
   - Added keep_sorted[..., :2] = True
   - Prevents survivors=1 by construction (defensive programming)

3. Fix smoothing to uniform over kept set (eagle.py lines 275-287):
   - Before: Mixed with untempered baseline (wrong approach)
   - After:  Uniform distribution over survivors only (correct)
   - Prevents q from reaching exactly 1.0 in corner cases

4. Remove dead code (eagle.py line 322):
   - Deleted unused self._current_sampling_metadata assignment
   - No longer needed with draft-anchored approach (bug #2 fix)

Expected results:
- tau_q ≥ 2.0 at ultracold temps → softer distribution
- NUC_DEBUG: survivors = hundreds/thousands (not 1-2)
- Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0)
- Accept rate: dynamic range restored across temp sweep
dcmaddix referenced this pull request in dcmaddix/vllm Oct 5, 2025
vllm-bot pushed a commit that referenced this pull request Oct 9, 2025
zhangsicheng5 pushed a commit to zhangsicheng5/vllm that referenced this pull request Oct 9, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
IwakuraRein pushed a commit to IwakuraRein/vllm that referenced this pull request Oct 21, 2025
Fixes for support_materials/2-tilelang/
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
wangln19 pushed a commit to wangln19/vllm that referenced this pull request Oct 27, 2025
Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Nov 4, 2025
* # This is a combination of 6 commits.
# This is the 1st commit message:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#2:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#3:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#4:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#5:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#6:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

* mooncake store connector

Signed-off-by: CHEN <[email protected]>

* mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

fix comments

* Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py

Co-authored-by: Copilot <[email protected]>

* Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py

Co-authored-by: Copilot <[email protected]>

* Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py

Co-authored-by: Copilot <[email protected]>

* Apply suggestion from @wuhang2014

line length format

* Apply suggestion from @wuhang2014

remove extra empty line

---------

Signed-off-by: CHEN <[email protected]>
Co-authored-by: wuhang <[email protected]>
Co-authored-by: Copilot <[email protected]>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
access2rohit pushed a commit to access2rohit/vllm that referenced this pull request Nov 11, 2025
jscaldwell55 added a commit to jscaldwell55/vllm that referenced this pull request Nov 12, 2025
- Add section-level state machine (in_tool_section flag)
- Implement rolling buffer for split marker detection (1KB cap)
- Suppress content between section_begin and tool_call_begin
- Support marker variants (plural/singular)
- Add error recovery for malformed sections (8KB limit)
- Preserve function contract (always return DeltaMessage)
- Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk
  (Changed elif to if on line 237 to prevent state corruption)
- Fix critical bug vllm-project#2: Defer section exit when tool_call_end present
  (Prevents dropping final tool arguments and token leakage)
- Include 12 comprehensive tests (3 new tests for edge cases)

Fixes bug where text between <|tool_calls_section_begin|> and
<|tool_call_begin|> leaks into reasoning_delta during streaming mode.

Also fixes two critical edge cases:
1. Section begin and end markers appearing in same chunk would leave
   parser stuck in in_tool_section=True, causing subsequent content
   to be incorrectly suppressed.
2. Tool_call_end and section_end in same chunk would cause early
   return before tool parsing, dropping final tool arguments and
   leaking special tokens into reasoning channel.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Signed-off-by: Jscaldwell55 <[email protected]>
jscaldwell55 added a commit to jscaldwell55/vllm that referenced this pull request Nov 12, 2025
- Add section-level state machine (in_tool_section flag)
- Implement rolling buffer for split marker detection (1KB cap)
- Suppress content between section_begin and tool_call_begin
- Support marker variants (plural/singular)
- Add error recovery for malformed sections (8KB limit)
- Preserve function contract (always return DeltaMessage)
- Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk
  (Changed elif to if on line 237 to prevent state corruption)
- Fix critical bug vllm-project#2: Defer section exit when tool_call_end present
  (Prevents dropping final tool arguments and token leakage)
- Include 12 comprehensive tests (3 new tests for edge cases)

Fixes bug where text between <|tool_calls_section_begin|> and
<|tool_call_begin|> leaks into reasoning_delta during streaming mode.

Also fixes two critical edge cases:
1. Section begin and end markers appearing in same chunk would leave
   parser stuck in in_tool_section=True, causing subsequent content
   to be incorrectly suppressed.
2. Tool_call_end and section_end in same chunk would cause early
   return before tool parsing, dropping final tool arguments and
   leaking special tokens into reasoning channel.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Signed-off-by: Jscaldwell55 <[email protected]>
jscaldwell55 added a commit to jscaldwell55/vllm that referenced this pull request Nov 12, 2025
- Add section-level state machine (in_tool_section flag)
- Implement rolling buffer for split marker detection (1KB cap)
- Suppress content between section_begin and tool_call_begin
- Support marker variants (plural/singular)
- Add error recovery for malformed sections (8KB limit)
- Preserve function contract (always return DeltaMessage)
- Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk
  (Changed elif to if on line 237 to prevent state corruption)
- Fix critical bug vllm-project#2: Defer section exit when tool_call_end present
  (Prevents dropping final tool arguments and token leakage)
- Include 12 comprehensive tests (3 new tests for edge cases)

Fixes bug where text between <|tool_calls_section_begin|> and
<|tool_call_begin|> leaks into reasoning_delta during streaming mode.

Also fixes two critical edge cases:
1. Section begin and end markers appearing in same chunk would leave
   parser stuck in in_tool_section=True, causing subsequent content
   to be incorrectly suppressed.
2. Tool_call_end and section_end in same chunk would cause early
   return before tool parsing, dropping final tool arguments and
   leaking special tokens into reasoning channel.

Signed-off-by: Jscaldwell55 <[email protected]>
yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 14, 2025
* remove contiguous

Signed-off-by: mayuyuace <[email protected]>

* remove comment

Signed-off-by: mayuyuace <[email protected]>

---------

Signed-off-by: mayuyuace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants