-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Frontend] track server_load #13950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] track server_load #13950
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This won't work as-is with multiple frontend processes (API servers), but this also won't be the only metric we have to fix. It's noted as one of the challenges, though we know the expected solution. It's mentioned in the design doc linked from #12705 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just one counter update, I think using regular threading.Lock() might be more efficient than asyncio.Lock, avoiding async context switches.
vllm/entrypoints/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a finally block to decrement the counter?
vllm/entrypoints/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will http streaming request be handled in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I can tell from testing, starlette will run the background task only after streaming is complete or when the connection is closed
|
+1 on making sure there's no performance impact, we've had some nasty middleware-related performance issues before. |
|
Thanks for the reviews :) Made another pass taking all the feedback into consideration |
youngkent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the perf concerns of the middleware, could you paste some benchmark data in the PR summary?
|
thanks for the review @youngkent resolved all the comments and added a unit test for the /load route. Also attached the benchmark results showing the latency comparison before and after the middleware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating and adding the latency benchmark data!
It seems latency is regressed by ~1% with middleware enabled, which is a bit higher than expected. Can we understand a bit more if the latency is added to TTFT, TPOT, or just added to the end of a http request due to async background task? (benchmark_serving.py should give some more detailed info)
If it's really stat-sig and impacting TTFT, TPOT, we might need to consider exploring a more efficient implementation without using middleware? e.g. Implement increment/decrement directly in API calls in api_server.
vllm/entrypoints/api_server.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this endpoint not supported for non-openai server? If so, should we return an error by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meant to remove this - from the pattern I see its common to have some endpoints such as /ping and /version that are only defined and implemented in the openai server so I followed the pattern for /load and no longer define it in the non-openai server
|
@youngkent ran benchmark_serving.py twice with and without the middleware and added to the PR description. Is it safe to assume the discrepancy between the runs is due to random +/- error and the middleware's impact is negligible? |
|
@daniel-salib Seems the benchmark_serving.py data have wide variance, hard to tell if it's really stat-sig. We might want to increase the sample size, meaning more requests per control and test group. Can we test with 20000 requests per group? |
|
@youngkent updated the PR description with the benchmark comparison across 20,000 requests per group |
|
Thanks @daniel-salib, the variance is still a bit high, but at least there is no conclusive signal showing middleware regress the performance so far. |
|
I was previously using the random dataset when benchmarking - thought that may have an effect on the variance. I updated the description with the results after I switched to --dataset-name=sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json but still see some variance |
|
What type of GPU is the performance test run on? |
|
Thanks for the review @robertgshaw2-redhat I ran the performance test on 2 x H100 GPU |
|
The serving throughput degradation is pretty serious, almost a 3% drop. We cannot merge this in as-is due to the performance impact. I added arbitrary middleware support here vllm/vllm/entrypoints/openai/cli_args.py Line 185 in a7ea35a
|
|
@youngkent took a different approach that should be much better performance-wise. Updated the description to include the latest benchmarks |
simon-mo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good approach now. Two questions
- Please put this behind a feature flag so by default there is no perf regression
- Please define the semantics of "load", which endpoint does it actually cover? I'm confused on why tokenize endpoint is being covered here as it doesn't hit GPU at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the inc and dec can be inlined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it might be better to NOT count tokenizer as the load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If create_chat_completion throw exception, it would not decrement server load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea good catch. I added a dep for the request to decrement on any exception now
|
@youngkent handling exceptions on streaming requests now - PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is a bit more complex to read now. How about we just make a wrapper function like
async def create_chat_completion(...):
if not raw_request.app.state.enable_server_load_tracking:
return create_chat_completion(...)
else:
increment_server_load(raw_request)
try:
response = create_chat_completion(...)
response.background = BackgroundTask(...)
return response
except:
decrement_server_load(raw_request)
We can further make the above common logic a method like load_aware_call(http_method, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea looks much cleaner - I made another pass with "load_aware_streaming_call" implemented. I only applied it to steaming calls because I thought it might be unnecessary for the non streaming requests. The non streaming calls still use the dependency injection.
LMK how it looks and if its better for everything to use load_aware_call
thanks!
vllm/entrypoints/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could unify the track_server_load_non_streaming and load_aware_streaming_call into a "@load_aware" annotation, put in front of the http methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good - I implemented as a decorator now and removed the dependency injection
|
implemented @youngkent 's latest suggestions also re-ran the benchmark and updated the PR comment to ensure no regression |
vllm/entrypoints/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BackgroundTask type does not have add_task method? https://github.com/encode/starlette/blob/49158732f8a96744889f5a7644a42ae08131bd15/starlette/background.py#L17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch - I added handling to convert to from a single BackgroundTask -> BackgroundTasks which will us to chain multiple tasks together when needed. Did manual testing locally to confirm its working
youngkent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment. Otherwise, looks good. Stamping.
|
Pre-commit (i.e. lint) is failing. |
|
@simon-mo rebased and the precommit is passing now |
|
@simon-mo I see some failures in the entrypoints test but they seem unrelated to the PR: I think we're good to merge but LMK if this is an issue. Thanks! `
|
|
I think the entrypoints failures are related to this PR because they're not failing on main, please fix them. |
Signed-off-by: Daniel Salib <[email protected]>
|
ah I found the issue, adding the extra unit test to test_metrics messed up the expected metric numbers in metric counts because the tests run on the same server instance. Decided it would be best to keep the metrics test just for prometheus metrics and felt is may be more appropriate to test the server_load arg in test_basic |
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
accurate realtime request concurrency tracking. Added the /load api to retrieve the realtime concurrency count.
benchmark_serving.py with the load tracking:
benchmark_serving.py without the load tracking: