-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Perf] Support stream interval for reducing host overhead #27869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Perf] Support stream interval for reducing host overhead #27869
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a --stream-interval feature to buffer streaming responses, which shows promising performance gains. However, the implementation has a critical flaw that causes data loss in streamed responses by dropping intermediate chunks that don't meet the buffering threshold. Additionally, the new stream_interval parameter lacks validation, allowing non-positive values that can lead to unexpected behavior. I have provided detailed comments on these critical and high-severity issues with suggestions for how to address them.
a06a73c to
21694ae
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/entrypoints/openai/serving_chat.py
Lines 1294 to 1298 in 21694ae
| except Exception as e: | |
| # TODO: Use a vllm-specific Validation Error | |
| logger.exception("Error in chat completion stream generator.") | |
| data = self.create_streaming_error_response(str(e)) | |
| yield f"data: {data}\n\n" |
The new stream batching stores JSON chunks in chunks_buffered and only flushes when the first token is sent, the buffer reaches stream_interval, or a finish reason arrives. If the underlying result_generator raises before hitting any of those conditions (e.g., model error or client disconnect after a few tokens), the buffered chunks are never emitted because the except branch returns an error response without draining the buffers. Prior to this change each chunk was sent immediately, so partial generations were still delivered before the error. With buffering enabled the client now silently loses every token after the last flush whenever an exception occurs mid-stream. Consider flushing any remaining buffered chunks before sending the error so streamed responses remain as complete as possible even when the request fails.
vllm/vllm/entrypoints/openai/serving_completion.py
Lines 522 to 525 in 21694ae
| except Exception as e: | |
| # TODO: Use a vllm-specific Validation Error | |
| data = self.create_streaming_error_response(str(e)) | |
| yield f"data: {data}\n\n" |
Similar to the chat path, completion streaming now buffers multiple SSE payloads per choice and only flushes when the first token, a full buffer, or a finish signal is seen. When result_generator raises an exception before a flush point, the code jumps to the except block and yields the error without first emitting buffered entries, so clients miss tokens that were already generated but not yet flushed. This regression means any mid-stream failure causes token loss that did not happen before the batching change. Flushing the per-choice buffers before yielding the error response would preserve already generated output.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for streaming responses by adding a --stream-interval flag. This allows buffering a specified number of tokens before sending them to the client, which can significantly reduce host overhead and improve throughput at high concurrency, as demonstrated by the performance tests. The implementation is well-executed across both chat and text completion endpoints. The buffering logic correctly handles key edge cases, such as immediately sending the first token, flushing the buffer when it's full, and sending the final chunk upon completion. The changes are clean and the new command-line argument is appropriately documented. I found no high or critical severity issues in this pull request.
|
Could you add to perf results command line that use used for server and client and what GPU it was tested |
21694ae to
fdbb14e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
fdbb14e to
ba65149
Compare
|
Potentially a duplicate of #27376 |
04e4a88 to
ef71ab7
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a --stream-interval flag to buffer generated tokens before sending them to the client, aiming to reduce host overhead and improve performance. The changes are well-implemented across the configuration, argument parsing, and engine layers. The core logic in vllm/v1/engine/output_processor.py correctly handles token buffering for streaming outputs, including edge cases like request completion and different output kinds. The implementation appears robust and the performance gains demonstrated in the description are significant. I have reviewed the code and found no critical or high-severity issues.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: elvischenv <[email protected]>
ef71ab7 to
00e9dcb
Compare
Signed-off-by: elvischenv <[email protected]>
00e9dcb to
103677a
Compare
|
CC @aarnphm @chaunceyjiang is it possible to implement all logic in api server? |
| finished = finish_reason is not None | ||
| final_only = self.output_kind == RequestOutputKind.FINAL_ONLY | ||
|
|
||
| if not finished and final_only: | ||
| # Only the final output is required in FINAL_ONLY mode. | ||
| return None | ||
|
|
||
| # Stream Interval buffering: only apply for DELTA mode and stream_interval > 1 | ||
| is_delta_streaming = self.output_kind == RequestOutputKind.DELTA | ||
| if is_delta_streaming and self.stream_interval > 1: | ||
| # Track total tokens generated | ||
| self.total_num_output_tokens += len(new_token_ids) | ||
|
|
||
| # should send output when it is the first token or reach the stream interval | ||
| should_send_output = ( | ||
| self.sent_tokens_offset == 0 | ||
| or self.total_num_output_tokens - self.sent_tokens_offset | ||
| >= self.stream_interval | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to implement all logic in api server?
@heheda12345 why do we want this? I feel like stream interval is kind of an extension of RequestOutputKind.DELTA. There is also RequestOutputKind.FINAL_ONLY which only output once after finishing.
|
Hi @elvischenv. Thanks for proposing this exciting feature! Would you mind sharing some insight into the key contributing factors? Any before/after profiling numbers would also be super helpful! Thanks again for your work on this! 🙌 |
@FENP Thanks for the question. Initially I don't expect this PR will bring so much TTFT perf gain but I do get similar perf numbers again after redoing the benchmarking. My implementation will just send the response when One possible reason is system resource contention. If stream_interval=1 and concurrency=1024, it means that 1024 generated responses will simultaneously enter the output processing. However, if stream_interval=10, the number of responses will be reduced by 10x, and the output processing speed may be significantly improved. |
|
@FENP I think what @elvischenv said makes sense. Basically, in the original case where output token throughput is very large, we are actually bounded by output http issue rate, so TTFT looks bad because the first output http message is stuck in the long queue (I think). |
|
@elvischenv @nvpohanh Got it — thank you for clarifying! |
@FENP Do you mean the "input" token streaming or "output" token streaming? |
Purpose
Support
--stream-interval <num_of_tokens>. The flag will buffer a number of tokens for a single request before sending back to client. The buffering can reduce the time needed to send responses back to clients, thereby saving host overhead like network/IO.cc @benchislett @pavanimajety @nvpohanh
Testing with
openai/gpt-oss-20b:B200 server cmd:
Accuracy Test
Without
--stream-interval(by default is set to1):With
--stream-interval 10:Perf Test
Without
--stream-interval(by default is set to1):With
--stream-interval 10: 57% e2e perf gainEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.