[Perf] Support stream interval for reducing host overhead #27869

elvischenv · 2025-10-31T08:26:08Z

Purpose

Support --stream-interval <num_of_tokens>. The flag will buffer a number of tokens for a single request before sending back to client. The buffering can reduce the time needed to send responses back to clients, thereby saving host overhead like network/IO.

cc @benchislett @pavanimajety @nvpohanh

Testing with `openai/gpt-oss-20b`:

B200 server cmd:

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
vllm serve openai/gpt-oss-20b \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--max-num-seqs 1024 \
--max-model-len 10240 \
--max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 2048 \
-O.pass_config.enable_noop=True \
--async-scheduling \
--no-enable-prefix-caching

Accuracy Test

OPENAI_API_KEY="test" \
python -m gpt_oss.evals \
--sampler chat_completions \
--model openai/gpt-oss-20b \
--reasoning-effort medium \
--n-threads 512 \
--eval aime25

Without --stream-interval(by default is set to 1):

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-20b-medium_temp1.0_20251031_095139', 'metric': 0.7083333333333334}]

With --stream-interval 10:

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-20b-medium_temp1.0_20251031_150201', 'metric': 0.7291666666666666}]

Perf Test

vllm bench serve \
--model openai/gpt-oss-20b \
--trust-remote-code \
--dataset-name random \
--ignore-eos \
--max-concurrency 1024 \
--num-prompts 5120 \
--random-input-len 1024 \
--random-output-len 1024

Without --stream-interval(by default is set to 1):

============ Serving Benchmark Result ============
Successful requests:                     5120
Failed requests:                         0
Maximum request concurrency:             1024
Benchmark duration (s):                  153.85
Total input tokens:                      5242880
Total generated tokens:                  5242880
Request throughput (req/s):              33.28
Output token throughput (tok/s):         34078.85
Peak output token throughput (tok/s):    42240.00
Peak concurrent requests:                1554.00
Total Token throughput (tok/s):          68157.70
---------------Time to First Token----------------
Mean TTFT (ms):                          2698.80
Median TTFT (ms):                        2300.56
P99 TTFT (ms):                           6826.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.16
Median TPOT (ms):                        27.40
P99 TPOT (ms):                           28.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.41
Median ITL (ms):                         25.22
P99 ITL (ms):                            89.70
==================================================

With --stream-interval 10: 57% e2e perf gain

============ Serving Benchmark Result ============
Successful requests:                     5120
Failed requests:                         0
Maximum request concurrency:             1024
Benchmark duration (s):                  98.17
Total input tokens:                      5242880
Total generated tokens:                  5242880
Request throughput (req/s):              52.15
Output token throughput (tok/s):         53403.94
Peak output token throughput (tok/s):    8216.00
Peak concurrent requests:                1180.00
Total Token throughput (tok/s):          106807.88
---------------Time to First Token----------------
Mean TTFT (ms):                          1002.64
Median TTFT (ms):                        262.37
P99 TTFT (ms):                           6749.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.09
Median TPOT (ms):                        18.68
P99 TPOT (ms):                           18.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           179.65
Median ITL (ms):                         137.12
P99 ITL (ms):                            512.61
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/entrypoints/openai/serving_chat.py

vllm/entrypoints/openai/serving_completion.py

gemini-code-assist

Code Review

This pull request introduces a --stream-interval feature to buffer streaming responses, which shows promising performance gains. However, the implementation has a critical flaw that causes data loss in streamed responses by dropping intermediate chunks that don't meet the buffering threshold. Additionally, the new stream_interval parameter lacks validation, allowing non-positive values that can lead to unexpected behavior. I have provided detailed comments on these critical and high-severity issues with suggestions for how to address them.

vllm/entrypoints/openai/serving_chat.py

vllm/entrypoints/openai/serving_completion.py

vllm/entrypoints/openai/serving_chat.py

vllm/entrypoints/openai/serving_completion.py

elvischenv · 2025-10-31T08:55:52Z

@codex review

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 1294 to 1298 in 21694ae

    
           except Exception as e: 
        
               # TODO: Use a vllm-specific Validation Error 
        
               logger.exception("Error in chat completion stream generator.") 
        
               data = self.create_streaming_error_response(str(e)) 
        
               yield f"data: {data}\n\n"

Flush buffered chat chunks on generator errors

The new stream batching stores JSON chunks in chunks_buffered and only flushes when the first token is sent, the buffer reaches stream_interval, or a finish reason arrives. If the underlying result_generator raises before hitting any of those conditions (e.g., model error or client disconnect after a few tokens), the buffered chunks are never emitted because the except branch returns an error response without draining the buffers. Prior to this change each chunk was sent immediately, so partial generations were still delivered before the error. With buffering enabled the client now silently loses every token after the last flush whenever an exception occurs mid-stream. Consider flushing any remaining buffered chunks before sending the error so streamed responses remain as complete as possible even when the request fails.

vllm/vllm/entrypoints/openai/serving_completion.py

Lines 522 to 525 in 21694ae

    
           except Exception as e: 
        
               # TODO: Use a vllm-specific Validation Error 
        
               data = self.create_streaming_error_response(str(e)) 
        
               yield f"data: {data}\n\n"

Completion streamer drops buffered tokens on exceptions

Similar to the chat path, completion streaming now buffers multiple SSE payloads per choice and only flushes when the first token, a full buffer, or a finish signal is seen. When result_generator raises an exception before a flush point, the code jumps to the except block and yields the error without first emitting buffered entries, so clients miss tokens that were already generated but not yet flushed. This regression means any mid-stream failure causes token loss that did not happen before the batching change. Flushing the per-choice buffers before yielding the error response would preserve already generated output.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

elvischenv · 2025-10-31T09:03:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a performance optimization for streaming responses by adding a --stream-interval flag. This allows buffering a specified number of tokens before sending them to the client, which can significantly reduce host overhead and improve throughput at high concurrency, as demonstrated by the performance tests. The implementation is well-executed across both chat and text completion endpoints. The buffering logic correctly handles key edge cases, such as immediately sending the first token, flushing the buffer when it's full, and sending the final chunk upon completion. The changes are clean and the new command-line argument is appropriately documented. I found no high or critical severity issues in this pull request.

vadiklyutiy · 2025-10-31T13:31:21Z

Could you add to perf results command line that use used for server and client and what GPU it was tested

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/engine/output_processor.py

vllm/config/scheduler.py

hmellor · 2025-10-31T15:48:24Z

Potentially a duplicate of #27376

elvischenv · 2025-10-31T16:15:48Z

@codex review

chatgpt-codex-connector · 2025-10-31T16:20:34Z

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

elvischenv · 2025-10-31T16:21:04Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a --stream-interval flag to buffer generated tokens before sending them to the client, aiming to reduce host overhead and improve performance. The changes are well-implemented across the configuration, argument parsing, and engine layers. The core logic in vllm/v1/engine/output_processor.py correctly handles token buffering for streaming outputs, including edge cases like request completion and different output kinds. The implementation appears robust and the performance gains demonstrated in the description are significant. I have reviewed the code and found no critical or high-severity issues.

vllm/v1/engine/output_processor.py

mergify · 2025-11-03T01:20:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: elvischenv <[email protected]>

heheda12345 · 2025-11-04T04:42:46Z

CC @aarnphm @chaunceyjiang is it possible to implement all logic in api server?

elvischenv · 2025-11-04T05:43:19Z

vllm/v1/engine/output_processor.py

        finished = finish_reason is not None
        final_only = self.output_kind == RequestOutputKind.FINAL_ONLY

        if not finished and final_only:
            # Only the final output is required in FINAL_ONLY mode.
            return None

+        # Stream Interval buffering: only apply for DELTA mode and stream_interval > 1
+        is_delta_streaming = self.output_kind == RequestOutputKind.DELTA
+        if is_delta_streaming and self.stream_interval > 1:
+            # Track total tokens generated
+            self.total_num_output_tokens += len(new_token_ids)
+
+            # should send output when it is the first token or reach the stream interval
+            should_send_output = (
+                self.sent_tokens_offset == 0
+                or self.total_num_output_tokens - self.sent_tokens_offset
+                >= self.stream_interval
+            )


is it possible to implement all logic in api server?

@heheda12345 why do we want this? I feel like stream interval is kind of an extension of RequestOutputKind.DELTA. There is also RequestOutputKind.FINAL_ONLY which only output once after finishing.

FENP · 2025-11-04T06:32:59Z

Hi @elvischenv. Thanks for proposing this exciting feature!
I'm trying to understand how the TTFT gain is achieved with --stream-interval. IIUC, first token should be sended back to client as soon as possible to reduce TTFT.

Would you mind sharing some insight into the key contributing factors? Any before/after profiling numbers would also be super helpful!

Thanks again for your work on this! 🙌

elvischenv · 2025-11-04T09:45:13Z

I'm trying to understand how the TTFT gain is achieved with --stream-interval. IIUC, first token should be sended back to client as soon as possible to reduce TTFT.

Would you mind sharing some insight into the key contributing factors? Any before/after profiling numbers would also be super helpful!

@FENP Thanks for the question. Initially I don't expect this PR will bring so much TTFT perf gain but I do get similar perf numbers again after redoing the benchmarking. My implementation will just send the response when self.sent_tokens_offset == 0, that means it is the first response to send back to client. Then the offset will be moved forward and the tokens are batched by stream_interval to send.

            # should send output when it is the first token or reach the stream interval
            should_send_output = (
                self.sent_tokens_offset == 0
                or self.total_num_output_tokens - self.sent_tokens_offset
                >= self.stream_interval
            )

One possible reason is system resource contention. If stream_interval=1 and concurrency=1024, it means that 1024 generated responses will simultaneously enter the output processing. However, if stream_interval=10, the number of responses will be reduced by 10x, and the output processing speed may be significantly improved.

nvpohanh · 2025-11-05T08:51:36Z

@FENP I think what @elvischenv said makes sense. Basically, in the original case where output token throughput is very large, we are actually bounded by output http issue rate, so TTFT looks bad because the first output http message is stuck in the long queue (I think).

FENP · 2025-11-06T08:03:33Z

@elvischenv @nvpohanh Got it — thank you for clarifying!
FYI, I noticed a difference in Peak concurrent requests in the benchmark test results. I suspect that --stream-interval indirectly affects it by introducing delays in token streaming, similar to how --request-rate works in the bench serve ?

nvpohanh · 2025-11-06T08:25:59Z

I suspect that --stream-interval indirectly affects it by introducing delays in token streaming

@FENP Do you mean the "input" token streaming or "output" token streaming? --stream-interval controls how the server streams the output tokens to the client. It is not related to how the input tokens are streamed into the server or how the requests are submitted to the server.

elvischenv requested review from aarnphm and chaunceyjiang as code owners October 31, 2025 08:26

mergify bot added the frontend label Oct 31, 2025

elvischenv changed the title ~~support stream interval~~ [Perf] Support stream interval forr reducing host overhead Oct 31, 2025

elvischenv changed the title ~~[Perf] Support stream interval forr reducing host overhead~~ [Perf] Support stream interval for reducing host overhead Oct 31, 2025

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

vllm/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

elvischenv force-pushed the elvischenv/support-stream-interval branch from a06a73c to 21694ae Compare October 31, 2025 08:50

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

elvischenv marked this pull request as draft October 31, 2025 11:27

elvischenv force-pushed the elvischenv/support-stream-interval branch from 21694ae to fdbb14e Compare October 31, 2025 15:39

elvischenv marked this pull request as ready for review October 31, 2025 15:39

elvischenv requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 31, 2025 15:39

mergify bot added the v1 label Oct 31, 2025

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Outdated Show resolved Hide resolved

elvischenv force-pushed the elvischenv/support-stream-interval branch from fdbb14e to ba65149 Compare October 31, 2025 15:43

hmellor reviewed Oct 31, 2025

View reviewed changes

vllm/config/scheduler.py Outdated Show resolved Hide resolved

vllm/config/scheduler.py Outdated Show resolved Hide resolved

elvischenv force-pushed the elvischenv/support-stream-interval branch 2 times, most recently from 04e4a88 to ef71ab7 Compare October 31, 2025 16:15

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

nvpohanh reviewed Nov 3, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 3, 2025

support stream interval

e606ec5

Signed-off-by: elvischenv <[email protected]>

elvischenv force-pushed the elvischenv/support-stream-interval branch from ef71ab7 to 00e9dcb Compare November 3, 2025 03:47

mergify bot removed the needs-rebase label Nov 3, 2025

add warning

103677a

Signed-off-by: elvischenv <[email protected]>

elvischenv force-pushed the elvischenv/support-stream-interval branch from 00e9dcb to 103677a Compare November 3, 2025 03:53

elvischenv mentioned this pull request Nov 3, 2025

[Core] Allow vLLM to stream n tokens at a time #27376

Open

elvischenv commented Nov 4, 2025

View reviewed changes

Merge branch 'main' into elvischenv/support-stream-interval

217a03c

Merge branch 'main' into elvischenv/support-stream-interval

ca1c19d

	except Exception as e:
	# TODO: Use a vllm-specific Validation Error
	logger.exception("Error in chat completion stream generator.")
	data = self.create_streaming_error_response(str(e))
	yield f"data: {data}\n\n"

Uh oh!

[Perf] Support stream interval for reducing host overhead #27869

Are you sure you want to change the base?

[Perf] Support stream interval for reducing host overhead #27869

Conversation

elvischenv commented Oct 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing with openai/gpt-oss-20b:

Accuracy Test

Perf Test

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elvischenv commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

elvischenv commented Oct 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

vadiklyutiy commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hmellor commented Oct 31, 2025

Uh oh!

elvischenv commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 31, 2025

Uh oh!

elvischenv commented Oct 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Nov 3, 2025

Uh oh!

heheda12345 commented Nov 4, 2025

Uh oh!

elvischenv Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

FENP commented Nov 4, 2025

Uh oh!

elvischenv commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvpohanh commented Nov 5, 2025

Uh oh!

FENP commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvpohanh commented Nov 6, 2025

Uh oh!

Reviewers

elvischenv commented Oct 31, 2025 •

edited by github-actions bot

Loading

Testing with `openai/gpt-oss-20b`:

elvischenv commented Nov 4, 2025 •

edited

Loading

FENP commented Nov 6, 2025 •

edited

Loading