-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Core] Expose API endpoint /is_sleeping
#14312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jun Duan <[email protected]>
Signed-off-by: Jun Duan <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jun Duan <[email protected]>
youkaichao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the motivation sounds good to me, @njhill can you help take a look?
Signed-off-by: Jun Duan <[email protected]>
|
The changes themselves look fine to me I'm just unsure of how commonly needed this might be (same as @youkaichao's thought), especially if we ensure that the sleep/wakeup operations are idempotent (not sure if that's currently the case but should be trivial otherwise).
Could we make a change to just fail the requests in this case rather than crashing the engine? That could then also serve as the probe mechanism if needed. |
|
Thanks for @njhill 's review! I absolutely agree the @njhill suggested 'fail request when sleeping' feature is good to do. I think the probe currently implemented in the PR is necessary, even if the 'fail request when sleeping' feature is done. We may think from a user's perspective. The user could be a person who can't remember the sleeping status for a fleet of vLLM instances, or a k8s controller that just crashed/restarted and trying to rebuild the global state. It sounds more natural to directly query an API endpoint, rather than sending an inference request to each of the vLLM instances, then observe whether each of the request fails or succeeds. Moreover, if the inference-request-as-a-probe is sent to an awake engine, that request will be served and consumes extra resource. So IMHO, using an API endpoint is not only natural but also more efficient. |
|
@waltforme actually could you add a test for this? Probably just adding something to https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_sleep.py should suffice. |
Signed-off-by: Jun Duan <[email protected]>
@njhill Absolutely. Added into the suggested file. Thanks for checking this! |
|
not sure if this is a standard elsewhere, but we can follow k8s health API endpoint for this fwiw. (i also responded in the ticket) |
@aarnphm Thanks for the point! $ kubectl get --raw='/readyz/poststarthook/generic-apiserver-start-informers'
okWould you elaborate what we want to follow, for vLLM? |
|
https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks This is probably also related to production stack, but what I have in mind:
|
Signed-off-by: Jun Duan <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Jun Duan <[email protected]>
Signed-off-by: Jun Duan <[email protected]> Signed-off-by: Mu Huai <[email protected]>
This PR exposes a read-only API to check whether the engine is sleeping. More details are documented as #14311 .
FIX #14311