Skip to content

Conversation

@daniel-salib
Copy link
Contributor

@daniel-salib daniel-salib commented Feb 27, 2025

accurate realtime request concurrency tracking. Added the /load api to retrieve the realtime concurrency count.

benchmark_serving.py with the load tracking:

============ Serving Benchmark Result ============
Successful requests:                     20000     
Benchmark duration (s):                  581.50    
Total input tokens:                      4516852   
Total generated tokens:                  3751446   
Request throughput (req/s):              34.39     
Output token throughput (tok/s):         6451.36   
Total Token throughput (tok/s):          14218.98  
---------------Time to First Token----------------
Mean TTFT (ms):                          328943.69 
Median TTFT (ms):                        329329.01 
P99 TTFT (ms):                           557303.45 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.84     
Median TPOT (ms):                        31.75     
P99 TPOT (ms):                           77.55     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.21     
Median ITL (ms):                         23.08     
P99 ITL (ms):                            63.13     
==================================================

benchmark_serving.py without the load tracking:

============ Serving Benchmark Result ============
Successful requests:                     20000     
Benchmark duration (s):                  604.39    
Total input tokens:                      4516852   
Total generated tokens:                  3748047   
Request throughput (req/s):              33.09     
Output token throughput (tok/s):         6201.37   
Total Token throughput (tok/s):          13674.76  
---------------Time to First Token----------------
Mean TTFT (ms):                          346449.51 
Median TTFT (ms):                        341887.55 
P99 TTFT (ms):                           581491.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.11     
Median TPOT (ms):                        32.22     
P99 TPOT (ms):                           78.94     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.74     
Median ITL (ms):                         23.24     
P99 ITL (ms):                            65.95     
==================================================

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@simon-mo
Copy link
Collaborator

Middleware does looks cleaner to me. Please rename the metrics to server_load_metrics to be more readable. I also want to (1) clarify the performance impact (2) make sure this can work when we run multiple frontend processes (@russellb @njhill)

@russellb
Copy link
Member

This won't work as-is with multiple frontend processes (API servers), but this also won't be the only metric we have to fix. It's noted as one of the challenges, though we know the expected solution. It's mentioned in the design doc linked from #12705

Comment on lines 903 to 904
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just one counter update, I think using regular threading.Lock() might be more efficient than asyncio.Lock, avoiding async context switches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a finally block to decrement the counter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will http streaming request be handled in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I can tell from testing, starlette will run the background task only after streaming is complete or when the connection is closed

@njhill
Copy link
Member

njhill commented Feb 27, 2025

+1 on making sure there's no performance impact, we've had some nasty middleware-related performance issues before.

@daniel-salib
Copy link
Contributor Author

Thanks for the reviews :) Made another pass taking all the feedback into consideration

Copy link
Contributor

@youngkent youngkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the perf concerns of the middleware, could you paste some benchmark data in the PR summary?

@daniel-salib
Copy link
Contributor Author

thanks for the review @youngkent

resolved all the comments and added a unit test for the /load route.

Also attached the benchmark results showing the latency comparison before and after the middleware.

Copy link
Contributor

@youngkent youngkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating and adding the latency benchmark data!
It seems latency is regressed by ~1% with middleware enabled, which is a bit higher than expected. Can we understand a bit more if the latency is added to TTFT, TPOT, or just added to the end of a http request due to async background task? (benchmark_serving.py should give some more detailed info)
If it's really stat-sig and impacting TTFT, TPOT, we might need to consider exploring a more efficient implementation without using middleware? e.g. Implement increment/decrement directly in API calls in api_server.

Comment on lines 42 to 45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this endpoint not supported for non-openai server? If so, should we return an error by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meant to remove this - from the pattern I see its common to have some endpoints such as /ping and /version that are only defined and implemented in the openai server so I followed the pattern for /load and no longer define it in the non-openai server

@daniel-salib
Copy link
Contributor Author

daniel-salib commented Mar 4, 2025

@youngkent ran benchmark_serving.py twice with and without the middleware and added to the PR description.

Is it safe to assume the discrepancy between the runs is due to random +/- error and the middleware's impact is negligible?

@daniel-salib daniel-salib changed the title add middleware to track concurrent_requests add middleware to track server_load Mar 4, 2025
@youngkent
Copy link
Contributor

@daniel-salib Seems the benchmark_serving.py data have wide variance, hard to tell if it's really stat-sig. We might want to increase the sample size, meaning more requests per control and test group. Can we test with 20000 requests per group?

@daniel-salib
Copy link
Contributor Author

@youngkent updated the PR description with the benchmark comparison across 20,000 requests per group

@youngkent
Copy link
Contributor

Thanks @daniel-salib, the variance is still a bit high, but at least there is no conclusive signal showing middleware regress the performance so far.
@simon-mo Do you want to double check the data and see if this implementation looks good to you? thanks

@daniel-salib
Copy link
Contributor Author

daniel-salib commented Mar 5, 2025

@youngkent

I was previously using the random dataset when benchmarking - thought that may have an effect on the variance.

I updated the description with the results after I switched to --dataset-name=sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json but still see some variance

@daniel-salib daniel-salib marked this pull request as ready for review March 5, 2025 22:46
@robertgshaw2-redhat
Copy link
Collaborator

What type of GPU is the performance test run on?

@daniel-salib
Copy link
Contributor Author

Thanks for the review @robertgshaw2-redhat

I ran the performance test on 2 x H100 GPU

@simon-mo
Copy link
Collaborator

simon-mo commented Mar 6, 2025

The serving throughput degradation is pretty serious, almost a 3% drop. We cannot merge this in as-is due to the performance impact. I added arbitrary middleware support here

if you really need it or alternatively figure out a better way

@daniel-salib daniel-salib changed the title add middleware to track server_load track server_load Mar 6, 2025
@daniel-salib
Copy link
Contributor Author

@youngkent took a different approach that should be much better performance-wise. Updated the description to include the latest benchmarks

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach now. Two questions

  1. Please put this behind a feature flag so by default there is no perf regression
  2. Please define the semantics of "load", which endpoint does it actually cover? I'm confused on why tokenize endpoint is being covered here as it doesn't hit GPU at all.

Comment on lines 270 to 283
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inc and dec can be inlined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might be better to NOT count tokenizer as the load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If create_chat_completion throw exception, it would not decrement server load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea good catch. I added a dep for the request to decrement on any exception now

@daniel-salib
Copy link
Contributor Author

@youngkent handling exceptions on streaming requests now - PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is a bit more complex to read now. How about we just make a wrapper function like

async def create_chat_completion(...):
    if not raw_request.app.state.enable_server_load_tracking:
        return create_chat_completion(...)
    else: 
        increment_server_load(raw_request)
        try:
            response = create_chat_completion(...)
            response.background = BackgroundTask(...)
            return response
        except:
            decrement_server_load(raw_request)

We can further make the above common logic a method like load_aware_call(http_method, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea looks much cleaner - I made another pass with "load_aware_streaming_call" implemented. I only applied it to steaming calls because I thought it might be unnecessary for the non streaming requests. The non streaming calls still use the dependency injection.

LMK how it looks and if its better for everything to use load_aware_call

thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could unify the track_server_load_non_streaming and load_aware_streaming_call into a "@load_aware" annotation, put in front of the http methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - I implemented as a decorator now and removed the dependency injection

@daniel-salib
Copy link
Contributor Author

implemented @youngkent 's latest suggestions

also re-ran the benchmark and updated the PR comment to ensure no regression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - I added handling to convert to from a single BackgroundTask -> BackgroundTasks which will us to chain multiple tasks together when needed. Did manual testing locally to confirm its working

Copy link
Contributor

@youngkent youngkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment. Otherwise, looks good. Stamping.

@simon-mo
Copy link
Collaborator

Pre-commit (i.e. lint) is failing.

@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025
@daniel-salib
Copy link
Contributor Author

@simon-mo rebased and the precommit is passing now

@daniel-salib
Copy link
Contributor Author

@simon-mo I see some failures in the entrypoints test but they seem unrelated to the PR:

I think we're good to merge but LMK if this is an issue. Thanks!

`
[2025-03-11T23:16:51Z] =========================== short test summary info ============================

  | [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[-True] - AssertionError: vllm:iteration_tokens_total_sum expected value of 200 did not match found value 219.0
  | [2025-03-11T23:16:51Z] assert 219.0 == 200
  | [2025-03-11T23:16:51Z] + where 219.0 = Sample(name='vllm:iteration_tokens_total_sum', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=219.0, timestamp=None, exemplar=None).value
  | [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[-False] - AssertionError: vllm:time_to_first_token_seconds_count expected value of 10 did not match found value 11.0
  | [2025-03-11T23:16:51Z] assert 11.0 == 10
  | [2025-03-11T23:16:51Z] + where 11.0 = Sample(name='vllm:time_to_first_token_seconds_count', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=11.0, timestamp=None, exemplar=None).value
  | [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--enable-chunked-prefill-False] - AssertionError: vllm:time_to_first_token_seconds_count expected value of 10 did not match found value 11.0
  | [2025-03-11T23:16:51Z] assert 11.0 == 10
  | [2025-03-11T23:16:51Z] + where 11.0 = Sample(name='vllm:time_to_first_token_seconds_count', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=11.0, timestamp=None, exemplar=None).value
  | [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--enable-chunked-prefill-True] - AssertionError: vllm:iteration_tokens_total_sum expected value of 200 did not match found value 219.0
  | [2025-03-11T23:16:51Z] assert 219.0 == 200
  | [2025-03-11T23:16:51Z] + where 219.0 = Sample(name='vllm:iteration_tokens_total_sum', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=219.0, timestamp=None, exemplar=None).value
  | [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--disable-frontend-multiprocessing-False] - AssertionError: vllm:time_to_first_token_seconds_count expected value of 10 did not match found value 11.0
  | [2025-03-11T23:16:51Z] assert 11.0 == 10
  | [2025-03-11T23:16:51Z] + where 11.0 = Sample(name='vllm:time_to_first_token_seconds_count', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=11.0, timestamp=None, exemplar=None).value
  | [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--disable-frontend-multiprocessing-True] - AssertionError: vllm:iteration_tokens_total_sum expected value of 200 did not match found value 219.0
  | [2025-03-11T23:16:51Z] assert 219.0 == 200
  | [2025-03-11T23:16:51Z] + where 219.0 = Sample(name='vllm:iteration_tokens_total_sum', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=219.0, timestamp=None, exemplar=None).value

`

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 14, 2025

I think the entrypoints failures are related to this PR because they're not failing on main, please fix them.

Signed-off-by: Daniel Salib <[email protected]>
@daniel-salib
Copy link
Contributor Author

ah I found the issue, adding the extra unit test to test_metrics messed up the expected metric numbers in metric counts because the tests run on the same server instance. Decided it would be best to keep the metrics test just for prometheus metrics and felt is may be more appropriate to test the server_load arg in test_basic

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 14, 2025 11:59
@DarkLight1337 DarkLight1337 changed the title track server_load [Frontend] track server_load Mar 14, 2025
@vllm-bot vllm-bot merged commit 73deea2 into vllm-project:main Mar 14, 2025
31 of 33 checks passed
richardsliu pushed a commit to richardsliu/vllm that referenced this pull request Mar 14, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants