Skip to content

Conversation

@waltforme
Copy link
Contributor

@waltforme waltforme commented Mar 5, 2025

This PR exposes a read-only API to check whether the engine is sleeping. More details are documented as #14311 .

FIX #14311

@github-actions
Copy link

github-actions bot commented Mar 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@youkaichao youkaichao marked this pull request as draft March 6, 2025 05:01
@waltforme waltforme marked this pull request as ready for review March 6, 2025 13:00
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the motivation sounds good to me, @njhill can you help take a look?

@njhill
Copy link
Member

njhill commented Mar 13, 2025

The changes themselves look fine to me I'm just unsure of how commonly needed this might be (same as @youkaichao's thought), especially if we ensure that the sleep/wakeup operations are idempotent (not sure if that's currently the case but should be trivial otherwise).

  1. Today a sleeping engine crashes if a request is sent to it. This probe will give a good citizen peace of mind before sending a request.

Could we make a change to just fail the requests in this case rather than crashing the engine? That could then also serve as the probe mechanism if needed.

@waltforme
Copy link
Contributor Author

Thanks for @njhill 's review! I absolutely agree the @njhill suggested 'fail request when sleeping' feature is good to do.

I think the probe currently implemented in the PR is necessary, even if the 'fail request when sleeping' feature is done.

We may think from a user's perspective. The user could be a person who can't remember the sleeping status for a fleet of vLLM instances, or a k8s controller that just crashed/restarted and trying to rebuild the global state. It sounds more natural to directly query an API endpoint, rather than sending an inference request to each of the vLLM instances, then observe whether each of the request fails or succeeds.

Moreover, if the inference-request-as-a-probe is sent to an awake engine, that request will be served and consumes extra resource. So IMHO, using an API endpoint is not only natural but also more efficient.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2025
@njhill
Copy link
Member

njhill commented Mar 15, 2025

@waltforme actually could you add a test for this? Probably just adding something to https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_sleep.py should suffice.

@waltforme
Copy link
Contributor Author

@waltforme actually could you add a test for this? Probably just adding something to https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_sleep.py should suffice.

@njhill Absolutely. Added into the suggested file. Thanks for checking this!

@aarnphm
Copy link
Collaborator

aarnphm commented Mar 15, 2025

not sure if this is a standard elsewhere, but we can follow k8s health API endpoint for this fwiw. (i also responded in the ticket)

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 15, 2025 11:34
@vllm-bot vllm-bot merged commit 74bc397 into vllm-project:main Mar 15, 2025
35 of 37 checks passed
@waltforme
Copy link
Contributor Author

not sure if this is a standard elsewhere, but we can follow k8s health API endpoint for this fwiw. (i also responded in the ticket)

@aarnphm Thanks for the point!
It looks to me, however, the k8s API health endpoints expose things that are very specific to k8s. For example, I tried one of them:

$ kubectl get --raw='/readyz/poststarthook/generic-apiserver-start-informers'
ok

Would you elaborate what we want to follow, for vLLM?

@waltforme waltforme deleted the sleep-probe branch March 16, 2025 06:09
@aarnphm
Copy link
Collaborator

aarnphm commented Mar 16, 2025

https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks

This is probably also related to production stack, but what I have in mind:

  • /readyz can be used to determine whether the engine is sleeping or not.
  • /livez can be used to determine where all workers are ready.

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Expose a read-only API to check whether engine is sleeping

5 participants