Skip to content

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Oct 31, 2025

Purpose

Support --stream-interval <num_of_tokens>. The flag will buffer a number of tokens for a single request before sending back to client. The buffering can reduce the time needed to send responses back to clients, thereby saving host overhead like network/IO.

cc @benchislett @pavanimajety @nvpohanh

Testing with openai/gpt-oss-20b:

B200 server cmd:

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
vllm serve openai/gpt-oss-20b \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--max-num-seqs 1024 \
--max-model-len 10240 \
--max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 2048 \
-O.pass_config.enable_noop=True \
--async-scheduling \
--no-enable-prefix-caching

Accuracy Test

OPENAI_API_KEY="test" \
python -m gpt_oss.evals \
--sampler chat_completions \
--model openai/gpt-oss-20b \
--reasoning-effort medium \
--n-threads 512 \
--eval aime25

Without --stream-interval(by default is set to 1):

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-20b-medium_temp1.0_20251031_095139', 'metric': 0.7083333333333334}]

With --stream-interval 10:

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-20b-medium_temp1.0_20251031_150201', 'metric': 0.7291666666666666}]

Perf Test

vllm bench serve \
--model openai/gpt-oss-20b \
--trust-remote-code \
--dataset-name random \
--ignore-eos \
--max-concurrency 1024 \
--num-prompts 5120 \
--random-input-len 1024 \
--random-output-len 1024

Without --stream-interval(by default is set to 1):

============ Serving Benchmark Result ============
Successful requests:                     5120
Failed requests:                         0
Maximum request concurrency:             1024
Benchmark duration (s):                  153.85
Total input tokens:                      5242880
Total generated tokens:                  5242880
Request throughput (req/s):              33.28
Output token throughput (tok/s):         34078.85
Peak output token throughput (tok/s):    42240.00
Peak concurrent requests:                1554.00
Total Token throughput (tok/s):          68157.70
---------------Time to First Token----------------
Mean TTFT (ms):                          2698.80
Median TTFT (ms):                        2300.56
P99 TTFT (ms):                           6826.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.16
Median TPOT (ms):                        27.40
P99 TPOT (ms):                           28.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.41
Median ITL (ms):                         25.22
P99 ITL (ms):                            89.70
==================================================

With --stream-interval 10: 57% e2e perf gain

============ Serving Benchmark Result ============
Successful requests:                     5120
Failed requests:                         0
Maximum request concurrency:             1024
Benchmark duration (s):                  98.17
Total input tokens:                      5242880
Total generated tokens:                  5242880
Request throughput (req/s):              52.15
Output token throughput (tok/s):         53403.94
Peak output token throughput (tok/s):    8216.00
Peak concurrent requests:                1180.00
Total Token throughput (tok/s):          106807.88
---------------Time to First Token----------------
Mean TTFT (ms):                          1002.64
Median TTFT (ms):                        262.37
P99 TTFT (ms):                           6749.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.09
Median TPOT (ms):                        18.68
P99 TPOT (ms):                           18.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           179.65
Median ITL (ms):                         137.12
P99 ITL (ms):                            512.61
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the frontend label Oct 31, 2025
@elvischenv elvischenv changed the title support stream interval [Perf] Support stream interval forr reducing host overhead Oct 31, 2025
@elvischenv elvischenv changed the title [Perf] Support stream interval forr reducing host overhead [Perf] Support stream interval for reducing host overhead Oct 31, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --stream-interval feature to buffer streaming responses, which shows promising performance gains. However, the implementation has a critical flaw that causes data loss in streamed responses by dropping intermediate chunks that don't meet the buffering threshold. Additionally, the new stream_interval parameter lacks validation, allowing non-positive values that can lead to unexpected behavior. I have provided detailed comments on these critical and high-severity issues with suggestions for how to address them.

@elvischenv elvischenv force-pushed the elvischenv/support-stream-interval branch from a06a73c to 21694ae Compare October 31, 2025 08:50
@elvischenv
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

except Exception as e:
# TODO: Use a vllm-specific Validation Error
logger.exception("Error in chat completion stream generator.")
data = self.create_streaming_error_response(str(e))
yield f"data: {data}\n\n"

P1 Badge Flush buffered chat chunks on generator errors

The new stream batching stores JSON chunks in chunks_buffered and only flushes when the first token is sent, the buffer reaches stream_interval, or a finish reason arrives. If the underlying result_generator raises before hitting any of those conditions (e.g., model error or client disconnect after a few tokens), the buffered chunks are never emitted because the except branch returns an error response without draining the buffers. Prior to this change each chunk was sent immediately, so partial generations were still delivered before the error. With buffering enabled the client now silently loses every token after the last flush whenever an exception occurs mid-stream. Consider flushing any remaining buffered chunks before sending the error so streamed responses remain as complete as possible even when the request fails.


except Exception as e:
# TODO: Use a vllm-specific Validation Error
data = self.create_streaming_error_response(str(e))
yield f"data: {data}\n\n"

P1 Badge Completion streamer drops buffered tokens on exceptions

Similar to the chat path, completion streaming now buffers multiple SSE payloads per choice and only flushes when the first token, a full buffer, or a finish signal is seen. When result_generator raises an exception before a flush point, the code jumps to the except block and yields the error without first emitting buffered entries, so clients miss tokens that were already generated but not yet flushed. This regression means any mid-stream failure causes token loss that did not happen before the batching change. Flushing the per-choice buffers before yielding the error response would preserve already generated output.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@elvischenv
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for streaming responses by adding a --stream-interval flag. This allows buffering a specified number of tokens before sending them to the client, which can significantly reduce host overhead and improve throughput at high concurrency, as demonstrated by the performance tests. The implementation is well-executed across both chat and text completion endpoints. The buffering logic correctly handles key edge cases, such as immediately sending the first token, flushing the buffer when it's full, and sending the final chunk upon completion. The changes are clean and the new command-line argument is appropriately documented. I found no high or critical severity issues in this pull request.

@elvischenv elvischenv marked this pull request as draft October 31, 2025 11:27
@vadiklyutiy
Copy link
Collaborator

Could you add to perf results command line that use used for server and client and what GPU it was tested

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@elvischenv elvischenv force-pushed the elvischenv/support-stream-interval branch from fdbb14e to ba65149 Compare October 31, 2025 15:43
@hmellor
Copy link
Member

hmellor commented Oct 31, 2025

Potentially a duplicate of #27376

@elvischenv elvischenv force-pushed the elvischenv/support-stream-interval branch 2 times, most recently from 04e4a88 to ef71ab7 Compare October 31, 2025 16:15
@elvischenv
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@elvischenv
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --stream-interval flag to buffer generated tokens before sending them to the client, aiming to reduce host overhead and improve performance. The changes are well-implemented across the configuration, argument parsing, and engine layers. The core logic in vllm/v1/engine/output_processor.py correctly handles token buffering for streaming outputs, including edge cases like request completion and different output kinds. The implementation appears robust and the performance gains demonstrated in the description are significant. I have reviewed the code and found no critical or high-severity issues.

@mergify
Copy link

mergify bot commented Nov 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 3, 2025
Signed-off-by: elvischenv <[email protected]>
@elvischenv elvischenv force-pushed the elvischenv/support-stream-interval branch from ef71ab7 to 00e9dcb Compare November 3, 2025 03:47
@mergify mergify bot removed the needs-rebase label Nov 3, 2025
Signed-off-by: elvischenv <[email protected]>
@heheda12345
Copy link
Collaborator

CC @aarnphm @chaunceyjiang is it possible to implement all logic in api server?

Comment on lines +207 to +225
finished = finish_reason is not None
final_only = self.output_kind == RequestOutputKind.FINAL_ONLY

if not finished and final_only:
# Only the final output is required in FINAL_ONLY mode.
return None

# Stream Interval buffering: only apply for DELTA mode and stream_interval > 1
is_delta_streaming = self.output_kind == RequestOutputKind.DELTA
if is_delta_streaming and self.stream_interval > 1:
# Track total tokens generated
self.total_num_output_tokens += len(new_token_ids)

# should send output when it is the first token or reach the stream interval
should_send_output = (
self.sent_tokens_offset == 0
or self.total_num_output_tokens - self.sent_tokens_offset
>= self.stream_interval
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to implement all logic in api server?

@heheda12345 why do we want this? I feel like stream interval is kind of an extension of RequestOutputKind.DELTA. There is also RequestOutputKind.FINAL_ONLY which only output once after finishing.

@FENP
Copy link
Contributor

FENP commented Nov 4, 2025

Hi @elvischenv. Thanks for proposing this exciting feature!
I'm trying to understand how the TTFT gain is achieved with --stream-interval. IIUC, first token should be sended back to client as soon as possible to reduce TTFT.

Would you mind sharing some insight into the key contributing factors? Any before/after profiling numbers would also be super helpful!

Thanks again for your work on this! 🙌

@elvischenv
Copy link
Contributor Author

elvischenv commented Nov 4, 2025

I'm trying to understand how the TTFT gain is achieved with --stream-interval. IIUC, first token should be sended back to client as soon as possible to reduce TTFT.

Would you mind sharing some insight into the key contributing factors? Any before/after profiling numbers would also be super helpful!

@FENP Thanks for the question. Initially I don't expect this PR will bring so much TTFT perf gain but I do get similar perf numbers again after redoing the benchmarking. My implementation will just send the response when self.sent_tokens_offset == 0, that means it is the first response to send back to client. Then the offset will be moved forward and the tokens are batched by stream_interval to send.

            # should send output when it is the first token or reach the stream interval
            should_send_output = (
                self.sent_tokens_offset == 0
                or self.total_num_output_tokens - self.sent_tokens_offset
                >= self.stream_interval
            )

One possible reason is system resource contention. If stream_interval=1 and concurrency=1024, it means that 1024 generated responses will simultaneously enter the output processing. However, if stream_interval=10, the number of responses will be reduced by 10x, and the output processing speed may be significantly improved.

@nvpohanh
Copy link
Contributor

nvpohanh commented Nov 5, 2025

@FENP I think what @elvischenv said makes sense. Basically, in the original case where output token throughput is very large, we are actually bounded by output http issue rate, so TTFT looks bad because the first output http message is stuck in the long queue (I think).

@FENP
Copy link
Contributor

FENP commented Nov 6, 2025

@elvischenv @nvpohanh Got it — thank you for clarifying!
FYI, I noticed a difference in Peak concurrent requests in the benchmark test results. I suspect that --stream-interval indirectly affects it by introducing delays in token streaming, similar to how --request-rate works in the bench serve ?

@nvpohanh
Copy link
Contributor

nvpohanh commented Nov 6, 2025

I suspect that --stream-interval indirectly affects it by introducing delays in token streaming

@FENP Do you mean the "input" token streaming or "output" token streaming? --stream-interval controls how the server streams the output tokens to the client. It is not related to how the input tokens are streamed into the server or how the requests are submitted to the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants