-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Support tensor parallel #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic! Left minor comments.
BTW, the sampling results were different when using TP:
- Current master (
python server.py --model facebook/opt-13b)
# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'
- 4-way TP (
python server.py --model facebook/opt-13b --tensor-parallel-size 4)
# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"
- 8-way TP (
python server.py --model facebook/opt-13b --tensor-parallel-size 8)
# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"
|
@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same. |
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @zhuohan123 for your huge effort! This is fantastic!
ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution instead of softening it (dividing by tau<1.0 doubles logit magnitudes). This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance stuck at ~0.7038 (average p_target). FIXES: 1. Config defaults (config.py, arg_utils.py): - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range) - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens) At draft_temp=0.05: - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!) - After: tau_q = max(0.05+0.25, 2.0) = 2.0 (2x softer) 2. Force min_keep=2 in nucleus (eagle.py line 271): - Added keep_sorted[..., :2] = True - Prevents survivors=1 by construction (defensive programming) 3. Fix smoothing to uniform over kept set (eagle.py lines 275-287): - Before: Mixed with untempered baseline (wrong approach) - After: Uniform distribution over survivors only (correct) - Prevents q from reaching exactly 1.0 in corner cases 4. Remove dead code (eagle.py line 322): - Deleted unused self._current_sampling_metadata assignment - No longer needed with draft-anchored approach (bug #2 fix) Expected results: - tau_q ≥ 2.0 at ultracold temps → softer distribution - NUC_DEBUG: survivors = hundreds/thousands (not 1-2) - Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0) - Accept rate: dynamic range restored across temp sweep
Signed-off-by: Nick Hill <[email protected]>
support cp for flashinfer-GQA
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>
Fixes for support_materials/2-tilelang/
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>
fix mtp config and padding
* # This is a combination of 6 commits. # This is the 1st commit message: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#2: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#3: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#4: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#5: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#6: mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> fix comments * Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Co-authored-by: Copilot <[email protected]> * Apply suggestion from @wuhang2014 line length format * Apply suggestion from @wuhang2014 remove extra empty line --------- Signed-off-by: CHEN <[email protected]> Co-authored-by: wuhang <[email protected]> Co-authored-by: Copilot <[email protected]>
vllm-project#26445) Signed-off-by: Nick Hill <[email protected]>
enable early exit for fused_moe_lora
- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Jscaldwell55 <[email protected]>
- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Jscaldwell55 <[email protected]>
- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. Signed-off-by: Jscaldwell55 <[email protected]>
* remove contiguous Signed-off-by: mayuyuace <[email protected]> * remove comment Signed-off-by: mayuyuace <[email protected]> --------- Signed-off-by: mayuyuace <[email protected]>
TODOs:
In another PR: