[Frontend] track server_load #13950

daniel-salib · 2025-02-27T07:44:38Z

accurate realtime request concurrency tracking. Added the /load api to retrieve the realtime concurrency count.

benchmark_serving.py with the load tracking:

============ Serving Benchmark Result ============
Successful requests:                     20000     
Benchmark duration (s):                  581.50    
Total input tokens:                      4516852   
Total generated tokens:                  3751446   
Request throughput (req/s):              34.39     
Output token throughput (tok/s):         6451.36   
Total Token throughput (tok/s):          14218.98  
---------------Time to First Token----------------
Mean TTFT (ms):                          328943.69 
Median TTFT (ms):                        329329.01 
P99 TTFT (ms):                           557303.45 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.84     
Median TPOT (ms):                        31.75     
P99 TPOT (ms):                           77.55     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.21     
Median ITL (ms):                         23.08     
P99 ITL (ms):                            63.13     
==================================================

benchmark_serving.py without the load tracking:

============ Serving Benchmark Result ============
Successful requests:                     20000     
Benchmark duration (s):                  604.39    
Total input tokens:                      4516852   
Total generated tokens:                  3748047   
Request throughput (req/s):              33.09     
Output token throughput (tok/s):         6201.37   
Total Token throughput (tok/s):          13674.76  
---------------Time to First Token----------------
Mean TTFT (ms):                          346449.51 
Median TTFT (ms):                        341887.55 
P99 TTFT (ms):                           581491.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.11     
Median TPOT (ms):                        32.22     
P99 TPOT (ms):                           78.94     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.74     
Median ITL (ms):                         23.24     
P99 ITL (ms):                            65.95     
==================================================

github-actions · 2025-02-27T07:44:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

simon-mo · 2025-02-27T18:32:22Z

Middleware does looks cleaner to me. Please rename the metrics to server_load_metrics to be more readable. I also want to (1) clarify the performance impact (2) make sure this can work when we run multiple frontend processes (@russellb @njhill)

russellb · 2025-02-27T20:07:22Z

This won't work as-is with multiple frontend processes (API servers), but this also won't be the only metric we have to fix. It's noted as one of the challenges, though we know the expected solution. It's mentioned in the design doc linked from #12705

youngkent · 2025-02-27T20:07:53Z

vllm/entrypoints/openai/api_server.py

Since this is just one counter update, I think using regular threading.Lock() might be more efficient than asyncio.Lock, avoiding async context switches.

youngkent · 2025-02-27T20:09:04Z

vllm/entrypoints/utils.py

We could use a finally block to decrement the counter?

youngkent · 2025-02-27T20:10:10Z

vllm/entrypoints/utils.py

How will http streaming request be handled in this case?

from what I can tell from testing, starlette will run the background task only after streaming is complete or when the connection is closed

vllm/entrypoints/openai/api_server.py

njhill · 2025-02-27T21:32:22Z

+1 on making sure there's no performance impact, we've had some nasty middleware-related performance issues before.

daniel-salib · 2025-02-28T09:21:43Z

Thanks for the reviews :) Made another pass taking all the feedback into consideration

youngkent

To address the perf concerns of the middleware, could you paste some benchmark data in the PR summary?

vllm/entrypoints/utils.py

vllm/entrypoints/openai/api_server.py

daniel-salib · 2025-03-03T12:58:47Z

thanks for the review @youngkent

resolved all the comments and added a unit test for the /load route.

Also attached the benchmark results showing the latency comparison before and after the middleware.

youngkent

Thanks for updating and adding the latency benchmark data!
It seems latency is regressed by ~1% with middleware enabled, which is a bit higher than expected. Can we understand a bit more if the latency is added to TTFT, TPOT, or just added to the end of a http request due to async background task? (benchmark_serving.py should give some more detailed info)
If it's really stat-sig and impacting TTFT, TPOT, we might need to consider exploring a more efficient implementation without using middleware? e.g. Implement increment/decrement directly in API calls in api_server.

youngkent · 2025-03-03T17:29:03Z

vllm/entrypoints/api_server.py

Is this endpoint not supported for non-openai server? If so, should we return an error by default?

meant to remove this - from the pattern I see its common to have some endpoints such as /ping and /version that are only defined and implemented in the openai server so I followed the pattern for /load and no longer define it in the non-openai server

daniel-salib · 2025-03-04T05:41:04Z

@youngkent ran benchmark_serving.py twice with and without the middleware and added to the PR description.

Is it safe to assume the discrepancy between the runs is due to random +/- error and the middleware's impact is negligible?

youngkent · 2025-03-04T17:09:06Z

@daniel-salib Seems the benchmark_serving.py data have wide variance, hard to tell if it's really stat-sig. We might want to increase the sample size, meaning more requests per control and test group. Can we test with 20000 requests per group?

daniel-salib · 2025-03-04T20:57:44Z

@youngkent updated the PR description with the benchmark comparison across 20,000 requests per group

youngkent · 2025-03-05T00:31:22Z

Thanks @daniel-salib, the variance is still a bit high, but at least there is no conclusive signal showing middleware regress the performance so far.
@simon-mo Do you want to double check the data and see if this implementation looks good to you? thanks

daniel-salib · 2025-03-05T01:17:07Z

@youngkent

I was previously using the random dataset when benchmarking - thought that may have an effect on the variance.

I updated the description with the results after I switched to --dataset-name=sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json but still see some variance

robertgshaw2-redhat · 2025-03-05T22:52:00Z

What type of GPU is the performance test run on?

vllm/entrypoints/openai/api_server.py

vllm/entrypoints/utils.py

daniel-salib · 2025-03-05T23:09:27Z

Thanks for the review @robertgshaw2-redhat

I ran the performance test on 2 x H100 GPU

simon-mo · 2025-03-06T00:49:21Z

The serving throughput degradation is pretty serious, almost a 3% drop. We cannot merge this in as-is due to the performance impact. I added arbitrary middleware support here

vllm/vllm/entrypoints/openai/cli_args.py

Line 185 in a7ea35a

"--middleware",

if you really need it or alternatively figure out a better way

daniel-salib · 2025-03-06T11:55:38Z

@youngkent took a different approach that should be much better performance-wise. Updated the description to include the latest benchmarks

simon-mo

I think this is a good approach now. Two questions

Please put this behind a feature flag so by default there is no perf regression
Please define the semantics of "load", which endpoint does it actually cover? I'm confused on why tokenize endpoint is being covered here as it doesn't hit GPU at all.

simon-mo · 2025-03-06T17:40:58Z

vllm/entrypoints/openai/api_server.py

the inc and dec can be inlined

youngkent · 2025-03-06T19:05:38Z

vllm/entrypoints/openai/api_server.py

Yeah, it might be better to NOT count tokenizer as the load.

youngkent · 2025-03-06T19:11:11Z

vllm/entrypoints/openai/api_server.py

If create_chat_completion throw exception, it would not decrement server load?

yea good catch. I added a dep for the request to decrement on any exception now

daniel-salib · 2025-03-07T19:34:05Z

@youngkent handling exceptions on streaming requests now - PTAL

youngkent · 2025-03-07T22:09:29Z

vllm/entrypoints/openai/api_server.py

The code is a bit more complex to read now. How about we just make a wrapper function like

async def create_chat_completion(...): if not raw_request.app.state.enable_server_load_tracking: return create_chat_completion(...) else: increment_server_load(raw_request) try: response = create_chat_completion(...) response.background = BackgroundTask(...) return response except: decrement_server_load(raw_request)

We can further make the above common logic a method like load_aware_call(http_method, ...)

yea looks much cleaner - I made another pass with "load_aware_streaming_call" implemented. I only applied it to steaming calls because I thought it might be unnecessary for the non streaming requests. The non streaming calls still use the dependency injection.

LMK how it looks and if its better for everything to use load_aware_call

thanks!

vllm/entrypoints/utils.py

youngkent · 2025-03-08T00:15:29Z

vllm/entrypoints/utils.py

Wondering if we could unify the track_server_load_non_streaming and load_aware_streaming_call into a "@load_aware" annotation, put in front of the http methods?

sounds good - I implemented as a decorator now and removed the dependency injection

daniel-salib · 2025-03-08T07:22:42Z

implemented @youngkent 's latest suggestions

also re-ran the benchmark and updated the PR comment to ensure no regression

youngkent · 2025-03-08T07:59:17Z

vllm/entrypoints/utils.py

The BackgroundTask type does not have add_task method? https://github.com/encode/starlette/blob/49158732f8a96744889f5a7644a42ae08131bd15/starlette/background.py#L17

good catch - I added handling to convert to from a single BackgroundTask -> BackgroundTasks which will us to chain multiple tasks together when needed. Did manual testing locally to confirm its working

youngkent

One more comment. Otherwise, looks good. Stamping.

simon-mo · 2025-03-11T19:37:21Z

Pre-commit (i.e. lint) is failing.

daniel-salib · 2025-03-11T21:29:19Z

@simon-mo rebased and the precommit is passing now

daniel-salib · 2025-03-12T23:43:22Z

@simon-mo I see some failures in the entrypoints test but they seem unrelated to the PR:

I think we're good to merge but LMK if this is an issue. Thanks!

`
[2025-03-11T23:16:51Z] =========================== short test summary info ============================

| [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[-True] - AssertionError: vllm:iteration_tokens_total_sum expected value of 200 did not match found value 219.0
| [2025-03-11T23:16:51Z] assert 219.0 == 200
| [2025-03-11T23:16:51Z] + where 219.0 = Sample(name='vllm:iteration_tokens_total_sum', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=219.0, timestamp=None, exemplar=None).value
| [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[-False] - AssertionError: vllm:time_to_first_token_seconds_count expected value of 10 did not match found value 11.0
| [2025-03-11T23:16:51Z] assert 11.0 == 10
| [2025-03-11T23:16:51Z] + where 11.0 = Sample(name='vllm:time_to_first_token_seconds_count', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=11.0, timestamp=None, exemplar=None).value
| [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--enable-chunked-prefill-False] - AssertionError: vllm:time_to_first_token_seconds_count expected value of 10 did not match found value 11.0
| [2025-03-11T23:16:51Z] assert 11.0 == 10
| [2025-03-11T23:16:51Z] + where 11.0 = Sample(name='vllm:time_to_first_token_seconds_count', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=11.0, timestamp=None, exemplar=None).value
| [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--enable-chunked-prefill-True] - AssertionError: vllm:iteration_tokens_total_sum expected value of 200 did not match found value 219.0
| [2025-03-11T23:16:51Z] assert 219.0 == 200
| [2025-03-11T23:16:51Z] + where 219.0 = Sample(name='vllm:iteration_tokens_total_sum', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=219.0, timestamp=None, exemplar=None).value
| [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--disable-frontend-multiprocessing-False] - AssertionError: vllm:time_to_first_token_seconds_count expected value of 10 did not match found value 11.0
| [2025-03-11T23:16:51Z] assert 11.0 == 10
| [2025-03-11T23:16:51Z] + where 11.0 = Sample(name='vllm:time_to_first_token_seconds_count', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=11.0, timestamp=None, exemplar=None).value
| [2025-03-11T23:16:51Z] FAILED entrypoints/openai/test_metrics.py::test_metrics_counts[--disable-frontend-multiprocessing-True] - AssertionError: vllm:iteration_tokens_total_sum expected value of 200 did not match found value 219.0
| [2025-03-11T23:16:51Z] assert 219.0 == 200
| [2025-03-11T23:16:51Z] + where 219.0 = Sample(name='vllm:iteration_tokens_total_sum', labels={'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}, value=219.0, timestamp=None, exemplar=None).value

`

DarkLight1337 · 2025-03-14T04:00:17Z

I think the entrypoints failures are related to this PR because they're not failing on main, please fix them.

Signed-off-by: Daniel Salib <[email protected]>

daniel-salib · 2025-03-14T11:55:32Z

ah I found the issue, adding the extra unit test to test_metrics messed up the expected metric numbers in metric counts because the tests run on the same server instance. Decided it would be best to keep the metrics test just for prometheus metrics and felt is may be more appropriate to test the server_load arg in test_basic

Signed-off-by: Richard Liu <[email protected]>

Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: Mu Huai <[email protected]>

mergify bot added the frontend label Feb 27, 2025

daniel-salib mentioned this pull request Feb 27, 2025

add num_concurrency_requests metric to track concurrent requests running/waiting #13799

Closed

youngkent reviewed Feb 27, 2025

View reviewed changes

youngkent reviewed Feb 28, 2025

View reviewed changes

vllm/entrypoints/utils.py Outdated Show resolved Hide resolved

vllm/entrypoints/utils.py Outdated Show resolved Hide resolved

vllm/entrypoints/utils.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

youngkent reviewed Mar 3, 2025

View reviewed changes

daniel-salib changed the title ~~add middleware to track concurrent_requests~~ add middleware to track server_load Mar 4, 2025

daniel-salib marked this pull request as ready for review March 5, 2025 22:46

daniel-salib requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners March 5, 2025 22:46

robertgshaw2-redhat reviewed Mar 5, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 5, 2025

View reviewed changes

vllm/entrypoints/utils.py Outdated Show resolved Hide resolved

daniel-salib changed the title ~~add middleware to track server_load~~ track server_load Mar 6, 2025

simon-mo approved these changes Mar 6, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated

Comment on lines 270 to 283

Copy link

Collaborator

simon-mo Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inc and dec can be inlined

youngkent reviewed Mar 6, 2025

View reviewed changes

youngkent reviewed Mar 7, 2025

View reviewed changes

youngkent reviewed Mar 8, 2025

View reviewed changes

youngkent approved these changes Mar 8, 2025

View reviewed changes

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025

track server_load

cd873d9

Signed-off-by: Daniel Salib <[email protected]>

DarkLight1337 enabled auto-merge (squash) March 14, 2025 11:59

DarkLight1337 changed the title ~~track server_load~~ [Frontend] track server_load Mar 14, 2025

vllm-bot merged commit 73deea2 into vllm-project:main Mar 14, 2025
31 of 33 checks passed

richardsliu pushed a commit to richardsliu/vllm that referenced this pull request Mar 14, 2025

[Frontend] track server_load (vllm-project#13950)

d1bec0d

Signed-off-by: Richard Liu <[email protected]>

DarkLight1337 mentioned this pull request Mar 24, 2025

[Bug]: Error occurred in v1/rerank interface after upgrading from version 0.7.3 to 0.8.1 #15371

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Frontend] track server_load (vllm-project#13950)

83bf6cd

Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Frontend] track server_load (vllm-project#13950)

972d267

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Frontend] track server_load (vllm-project#13950)

7d51f6f

Signed-off-by: Mu Huai <[email protected]>

Uh oh!

[Frontend] track server_load #13950

[Frontend] track server_load #13950

Uh oh!

Conversation

daniel-salib commented Feb 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2025

Uh oh!

simon-mo commented Feb 27, 2025

Uh oh!

russellb commented Feb 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill commented Feb 27, 2025

Uh oh!

daniel-salib commented Feb 28, 2025

Uh oh!

youngkent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-salib commented Mar 3, 2025

Uh oh!

youngkent left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-salib commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youngkent commented Mar 4, 2025

Uh oh!

daniel-salib commented Mar 4, 2025

Uh oh!

youngkent commented Mar 5, 2025

Uh oh!

daniel-salib commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

daniel-salib commented Mar 5, 2025

Uh oh!

simon-mo commented Mar 6, 2025

Uh oh!

daniel-salib commented Mar 6, 2025

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-salib commented Mar 7, 2025

Uh oh!

Choose a reason for hiding this comment

daniel-salib commented Feb 27, 2025 •

edited by github-actions bot

Loading

youngkent left a comment •

edited

Loading

daniel-salib commented Mar 4, 2025 •

edited

Loading

daniel-salib commented Mar 5, 2025 •

edited

Loading

`
[2025-03-11T23:16:51Z] =========================== short test summary info ============================

DarkLight1337 commented Mar 14, 2025 •

edited

Loading