From 06e26ec7f62d6044f96bbae1dee76c2c9bbbf57c Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Thu, 9 Oct 2025 19:27:16 +0000 Subject: [PATCH] Documentation updates from Promptless --- serverless/vllm/get-started.mdx | 92 +++---- serverless/vllm/openai-compatibility.mdx | 183 +++++++++----- serverless/vllm/overview.mdx | 135 ++++++---- serverless/vllm/vllm-requests.mdx | 303 ++++++++++++++++------- 4 files changed, 457 insertions(+), 256 deletions(-) diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx index 587cdef3..12209316 100644 --- a/serverless/vllm/get-started.mdx +++ b/serverless/vllm/get-started.mdx @@ -1,16 +1,17 @@ --- title: "Deploy a vLLM worker" +description: "Deploy a large language model using Runpod's vLLM workers and start serving requests in minutes." --- -Learn how to deploy a large language model (LLM) using Runpod's preconfigured vLLM workers. By the end of this guide, you'll have a fully functional API endpoint that you can use to handle LLM inference requests. +Learn how to deploy a large language model using Runpod's vLLM workers. By the end of this guide, you'll have a fully functional Serverless endpoint that can handle LLM inference requests. ## What you'll learn In this tutorial, you'll learn how to: -* Configure and deploy a vLLM worker using Runpod's Serverless platform. +* Configure and deploy a vLLM worker using Runpod Serverless. * Select the appropriate hardware and scaling settings for your model. -* Set up environmental variables to customize your deployment. +* Set up environment variables to customize your deployment. * Test your endpoint using the Runpod API. * Troubleshoot common issues that might arise during deployment. @@ -21,47 +22,33 @@ In this tutorial, you'll learn how to: ## Step 1: Choose your model -First, decide which LLM you want to deploy. The vLLM worker supports most Hugging Face models, including: +First, decide which LLM you want to deploy. The vLLM worker supports most models on Hugging Face, including: -* Llama 3 (e.g., `meta-llama/Llama-3.2-3B-Instruct`) -* Mistral (e.g., `mistralai/Ministral-8B-Instruct-2410`) -* Qwen3 (e.g., `Qwen/Qwen3-8B`) -* OpenChat (e.g., `openchat/openchat-3.5-0106`) -* Gemma (e.g., `google/gemma-3-1b-it`) -* Deepseek-R1 (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`) -* Phi-4 (e.g., `microsoft/Phi-4-mini-instruct`) +* Llama 3 (e.g., `meta-llama/Llama-3.2-3B-Instruct`). +* Mistral (e.g., `mistralai/Ministral-8B-Instruct-2410`). +* Qwen3 (e.g., `Qwen/Qwen3-8B`). +* OpenChat (e.g., `openchat/openchat-3.5-0106`). +* Gemma (e.g., `google/gemma-3-1b-it`). +* DeepSeek-R1 (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`). +* Phi-4 (e.g., `microsoft/Phi-4-mini-instruct`). -For this walkthrough, we'll use `openchat/openchat-3.5-0106`, but you can substitute this with [any compatible model](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#compatible-model-architectures). +For this walkthrough, we'll use `openchat/openchat-3.5-0106`, but you can substitute this with [any compatible model](https://docs.vllm.ai/en/latest/models/supported_models.html). ## Step 2: Deploy using the Runpod console -The easiest way to deploy a vLLM worker is through the Runpod console: +The easiest way to deploy a vLLM worker is through Runpod's Ready-to-Deploy Repos: -1. Navigate to the [Serverless page](https://www.console.runpod.io/serverless). +1. Find the [vLLM repo](https://console.runpod.io/hub/runpod-workers/worker-vllm) in the Runpod Hub. -2. Under **Quick Deploy**, find **Serverless vLLM** and click **Configure**. +2. Click **Deploy**, using the latest vLLM worker version. -3. In the deployment modal: +3. In the **Model (optional)** field, end the model name: `openchat/openchat-3.5-0106`. +4. Click **Advanced** to expand the vLLM settings. +5. Set **Max Model Length** to `8192` (or an appropriate context length for your model). +6. Leave other settings at their defaults unless you have specific requirements, then click **Next**. +7. Click **Create Endpoint** - * Select a vLLM version (latest stable recommended). - * Under **Hugging Face Models**, enter your model: `openchat/openchat-3.5-0106`. - * If using a gated model, enter your **Hugging Face Token**. - * Click **Next**. - -4. In the vLLM settings modal, under **LLM Settings**: - - * Set **Max Model Length** to `8192` (or an appropriate context length for your model). - * Leave other settings at their defaults unless you have specific requirements. - * Click **Next**. - -5. Make changes to the endpoint settings if you have specific requirements, then click **Deploy**. - - - - - - -Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads your model. +Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads the selected model. @@ -71,12 +58,9 @@ For more details on how to optimize your endpoint, see [Endpoint configurations] ## Step 3: Understand your endpoint -While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it: - -* Runpod is creating a Serverless endpoint with your specified configuration. -* The vLLM worker image is being deployed with your chosen model. +While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it. -Once deployment is complete, make a note of your **Endpoint ID**. You'll need this to make API requests. +Runpod is creating a Serverless endpoint with your specified configuration, and the vLLM worker image is being deployed using your chosen model. Once deployment is complete, make a note of your **Endpoint ID**, as you'll need this to make API requests. @@ -127,27 +111,27 @@ When the workers finish processing your request, you should see output on the ri } ``` -## Step 5: Customize your model (optional) +## Step 5: Customize your deployment with environment variables (optional) If you need to customize your model deployment, you can edit your endpoint settings to add environment variables. Here are some useful environment variables you might want to set: -* `MAX_MODEL_LEN`: Maximum context length (e.g., `16384`) -* `DTYPE`: Data type for model weights (`float16`, `bfloat16`, or `float32`) -* `GPU_MEMORY_UTILIZATION`: Controls VRAM usage (e.g., `0.95` for 95%) -* `CUSTOM_CHAT_TEMPLATE`: For models that need a custom chat template -* `OPENAI_SERVED_MODEL_NAME_OVERRIDE`: Change the model name to use in OpenAI requests +* `MAX_MODEL_LEN`: Maximum context length (e.g., `16384`). +* `DTYPE`: Data type for model weights (`float16`, `bfloat16`, or `float32`). +* `GPU_MEMORY_UTILIZATION`: Controls VRAM usage (e.g., `0.95` for 95%). +* `CUSTOM_CHAT_TEMPLATE`: For models that need a custom chat template. +* `OPENAI_SERVED_MODEL_NAME_OVERRIDE`: Change the model name to use in OpenAI requests. To add or modify environment variables: 1. Go to your endpoint details page. 2. Select **Manage**, then select **Edit Endpoint**. 3. Expand the **Public Environment Variables** section. -4. Add/edit your desired variables. +4. Add or edit your desired variables. 5. Click **Save Endpoint**. -You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings). +You can find a full list of available environment variables in the [environment variables documentation](/serverless/vllm/environment-variables). -You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters). +You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per response. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests). ## Troubleshooting @@ -160,11 +144,11 @@ If you encounter issues with your deployment: ## Next steps -Congratulations! You've successfully deployed a vLLM worker on Runpod's Serverless platform. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API. +Congratulations! You've successfully deployed a vLLM worker on Runpod Serverless. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API. Next you can try: -* [Sending requests using the Runpod API.](/serverless/vllm/vllm-requests) -* [Learning about vLLM's OpenAI API compatibility.](/serverless/vllm/openai-compatibility) -* [Customizing your vLLM worker's handler function.](/serverless/workers/handler-functions) -* [Building a custom worker for more specialized workloads.](/serverless/workers/custom-worker) +* [Sending requests using the Runpod API](/serverless/vllm/vllm-requests). +* [Learning about vLLM's OpenAI API compatibility](/serverless/vllm/openai-compatibility). +* [Customizing your vLLM worker's handler function](/serverless/workers/handler-functions). +* [Building a custom worker for more specialized workloads](/serverless/workers/custom-worker). diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx index 8d3c3d50..0ed8fa8c 100644 --- a/serverless/vllm/openai-compatibility.mdx +++ b/serverless/vllm/openai-compatibility.mdx @@ -1,19 +1,20 @@ --- title: "OpenAI API compatibility guide" -sidebarTitle: "OpenAI API compability" +sidebarTitle: "OpenAI API compatibility" +description: "Integrate vLLM workers with OpenAI client libraries and API-compatible tools." --- -Runpod's [vLLM workers](/serverless/vllm/overview) implement OpenAI API compatibility, allowing you to use familiar [OpenAI client libraries](https://platform.openai.com/docs/libraries) with your deployed models. This guide will help you understand how to leverage this compatibility to integrate your models seamlessly with existing OpenAI-based applications. +Runpod's vLLM workers implement OpenAI API compatibility, allowing you to use familiar [OpenAI client libraries](https://platform.openai.com/docs/libraries) with your deployed models. This guide explains how to leverage this compatibility to integrate your models seamlessly with existing OpenAI-based applications. ## Endpoint structure -When using the OpenAI-compatible API with Runpod, your requests will be directed to this base URL pattern: +When using the OpenAI-compatible API with Runpod, your requests are directed to this base URL pattern: -```bash -https://api.runpod.ai/v2/[ENDPOINT_ID]/openai/v1 +``` +https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1 ``` -Replace `[ENDPOINT_ID]` with your Serverless endpoint ID. +Replace `ENDPOINT_ID` with your Serverless endpoint ID. ## Supported APIs @@ -29,30 +30,30 @@ The vLLM worker implements these core OpenAI API endpoints: The `MODEL_NAME` environment variable is essential for all OpenAI-compatible API requests. This variable corresponds to either: -1. The [Hugging Face model](https://huggingface.co/models) you've deployed (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) -2. A custom name if you've set `OPENAI_SERVED_MODEL_NAME_OVERRIDE` as an environment variable +1. The [Hugging Face model](https://huggingface.co/models) you've deployed (e.g., `mistralai/Mistral-7B-Instruct-v0.2`). +2. A custom name if you've set `OPENAI_SERVED_MODEL_NAME_OVERRIDE` as an environment variable. + +This model name is used in chat and text completion API requests to identify which model should process your request. -This model name is used in chat/text completion API requests to identify which model should process your request. -## Initilization +## Initialize the OpenAI client -Before you can send API requests, start by setting up an OpenAI client with your Runpod API key and endpoint URL: +Before you can send API requests, set up an OpenAI client with your Runpod API key and endpoint URL: ```python from openai import OpenAI -import os -MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model +MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model client = OpenAI( - api_key=[RUNPOD_API_KEY], - base_url=f"https://api.runpod.ai/v2/[RUNPOD_ENDPOINT_ID]/openai/v1", + api_key="", + base_url="https://api.runpod.ai/v2//openai/v1", ) ``` -## Send a request +## Send requests -You can use Runpod's OpenAI compatible API to send requests to your Runpod endpoint, enabling you to use the same client libraries and code that you use with OpenAI's services. You only need to change the base URL to point to your Runpod endpoint. +You can use Runpod's OpenAI-compatible API to send requests to your Runpod endpoint, enabling you to use the same client libraries and code that you use with OpenAI's services. You only need to change the base URL to point to your Runpod endpoint. @@ -64,20 +65,19 @@ You can also send requests using [Runpod's native API](/serverless/vllm/vllm-req The `/chat/completions` endpoint is designed for instruction-tuned LLMs that follow a chat format. -#### Non-streaming request example +#### Non-streaming request Here's how you can make a basic chat completion request: ```python from openai import OpenAI -import os MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model # Initialize the OpenAI client client = OpenAI( - api_key=[RUNPOD_API_KEY], - base_url=f"https://api.runpod.ai/v2/[RUNPOD_ENDPOINT_ID]/openai/v1", + api_key="", + base_url="https://api.runpod.ai/v2//openai/v1", ) # Chat completion request (for instruction-tuned models) @@ -123,13 +123,11 @@ The API returns responses in this JSON format: } ``` -#### Streaming request example +#### Streaming request Streaming allows you to receive the model's output incrementally as it's generated, rather than waiting for the complete response. This real-time delivery enhances responsiveness, making it ideal for interactive applications like chatbots or for monitoring the progress of lengthy generation tasks. ```python -# ... Imports and initialization ... - # Create a streaming chat completion request stream = client.chat.completions.create( model=MODEL_NAME, @@ -154,16 +152,14 @@ print() The `/completions` endpoint is designed for base LLMs and text completion tasks. -#### Non-streaming request example +#### Non-streaming request Here's how you can make a text completion request: ```python -# ... Imports and initialization ... - # Text completion request response = client.completions.create( - model="mistralai/Mistral-7B-Instruct-v0.2", + model=MODEL_NAME, prompt="Write a poem about artificial intelligence:", temperature=0.7, max_tokens=150 @@ -199,11 +195,9 @@ The API returns responses in this JSON format: } ``` -#### Streaming request example +#### Streaming request ```python -# ... Imports and initialization ... - # Create a completion stream response_stream = client.completions.create( model=MODEL_NAME, @@ -212,6 +206,7 @@ response_stream = client.completions.create( max_tokens=100, stream=True, ) + # Stream the response for response in response_stream: print(response.choices[0].text or "", end="", flush=True) @@ -222,8 +217,6 @@ for response in response_stream: The `/models` endpoint allows you to get a list of available models on your endpoint: ```python -# ... Imports and initialization ... - models_response = client.models.list() list_of_models = [model.id for model in models_response] print(list_of_models) @@ -245,9 +238,67 @@ print(list_of_models) } ``` -## Request input parameters - -vLLM workers support various parameters to control generation behavior. You can find a complete list of OpenAI request input parameters on the [GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#openai-request-input-parameters). +## Chat completion parameters + +Here are all available parameters for the `/chat/completions` endpoint: + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `messages` | `list[dict[str, str]]` | Required | List of messages with `role` and `content` keys. The model's chat template will be applied automatically. | +| `model` | `string` | Required | The model repo that you've deployed on your Runpod Serverless endpoint. | +| `temperature` | `float` | `0.7` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | +| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | +| `n` | `int` | `1` | Number of output sequences to return for the given prompt. | +| `max_tokens` | `int` | None | Maximum number of tokens to generate per output sequence. | +| `seed` | `int` | None | Random seed to use for the generation. | +| `stop` | `string` or `list[str]` | `list` | String(s) that stop generation when produced. The returned output will not contain the stop strings. | +| `stream` | `bool` | `false` | Whether to stream the response. | +| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `logit_bias` | `dict[str, float]` | None | Unsupported by vLLM. | +| `user` | `string` | None | Unsupported by vLLM. | + +### Additional vLLM parameters + +vLLM supports additional parameters beyond the standard OpenAI API: + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `best_of` | `int` | None | Number of output sequences generated from the prompt. From these `best_of` sequences, the top `n` sequences are returned. Must be ≥ `n`. Treated as beam width when `use_beam_search` is `true`. | +| `top_k` | `int` | `-1` | Controls the number of top tokens to consider. Set to -1 to consider all tokens. | +| `ignore_eos` | `bool` | `false` | Whether to ignore the EOS token and continue generating tokens after EOS is generated. | +| `use_beam_search` | `bool` | `false` | Whether to use beam search instead of sampling. | +| `stop_token_ids` | `list[int]` | `list` | List of token IDs that stop generation when produced. The returned output will contain the stop tokens unless they are special tokens. | +| `skip_special_tokens` | `bool` | `true` | Whether to skip special tokens in the output. | +| `spaces_between_special_tokens` | `bool` | `true` | Whether to add spaces between special tokens in the output. | +| `add_generation_prompt` | `bool` | `true` | Whether to add generation prompt. Read more [here](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts). | +| `echo` | `bool` | `false` | Echo back the prompt in addition to the completion. | +| `repetition_penalty` | `float` | `1.0` | Penalizes new tokens based on whether they appear in the prompt and generated text so far. Values > 1 encourage new tokens, values < 1 encourage repetition. | +| `min_p` | `float` | `0.0` | Minimum probability for a token to be considered. | +| `length_penalty` | `float` | `1.0` | Penalizes sequences based on their length. Used in beam search. | +| `include_stop_str_in_output` | `bool` | `false` | Whether to include the stop strings in output text. | + +## Text completion parameters + +Here are all available parameters for the `/completions` endpoint: + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `prompt` | `string` or `list[str]` | Required | The prompt(s) to generate completions for. | +| `model` | `string` | Required | The model repo that you've deployed on your Runpod Serverless endpoint. | +| `temperature` | `float` | `0.7` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | +| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | +| `n` | `int` | `1` | Number of output sequences to return for the given prompt. | +| `max_tokens` | `int` | `16` | Maximum number of tokens to generate per output sequence. | +| `seed` | `int` | None | Random seed to use for the generation. | +| `stop` | `string` or `list[str]` | `list` | String(s) that stop generation when produced. The returned output will not contain the stop strings. | +| `stream` | `bool` | `false` | Whether to stream the response. | +| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `logit_bias` | `dict[str, float]` | None | Unsupported by vLLM. | +| `user` | `string` | None | Unsupported by vLLM. | + +Text completions support the same additional vLLM parameters as chat completions (see the Additional vLLM parameters section above). ## Environment variables @@ -255,11 +306,11 @@ Use these environment variables to customize the OpenAI compatibility: | Variable | Default | Description | | ----------------------------------- | ----------- | ------------------------------------------- | -| `RAW_OPENAI_OUTPUT` | `1` (true) | Enables raw OpenAI SSE format for streaming | -| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses | -| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions | +| `RAW_OPENAI_OUTPUT` | `1` (true) | Enables raw OpenAI SSE format for streaming. | +| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses. | +| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions. | -You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables). +For a complete list of vLLM environment variables, see the [environment variables documentation](/serverless/vllm/environment-variables). ## Client libraries @@ -267,16 +318,16 @@ The OpenAI-compatible API works with standard [OpenAI client libraries](https:// ### Python -```py +```python from openai import OpenAI client = OpenAI( - api_key="[RUNPOD_API_KEY]", - base_url=f"https://api.runpod.ai/v2/your_endpoint_id/openai/v1" + api_key="", + base_url="https://api.runpod.ai/v2//openai/v1" ) response = client.chat.completions.create( - model="[MODEL_NAME]", + model="", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} @@ -286,16 +337,16 @@ response = client.chat.completions.create( ### JavaScript -```js +```javascript import { OpenAI } from "openai"; const openai = new OpenAI({ - apiKey: "[RUNPOD_API_KEY]", - baseURL: "https://api.runpod.ai/v2/your_endpoint_id/openai/v1" + apiKey: "", + baseURL: "https://api.runpod.ai/v2//openai/v1" }); const response = await openai.chat.completions.create({ - model: "[MODEL_NAME]", + model: "", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello!" } @@ -307,16 +358,21 @@ const response = await openai.chat.completions.create({ While the vLLM worker aims for high compatibility, there are some differences from OpenAI's implementation: -1. **Token counting**: Token counts may differ slightly from OpenAI models. -2. **Streaming format**: The exact chunking of streaming responses may vary. -3. **Error format**: Error responses follow a similar but not identical format. -4. **Rate limits**: Rate limits follow Runpod's endpoint policies rather than OpenAI's. +**Token counting** may differ slightly from OpenAI models due to different tokenizers. + +**Streaming format** follows OpenAI's Server-Sent Events (SSE) format, but the exact chunking of streaming responses may vary. + +**Error responses** follow a similar but not identical format to OpenAI's error responses. + +**Rate limits** follow Runpod's endpoint policies rather than OpenAI's rate limiting structure. + +### Current limitations -The vLLM worker also currently has a few limitations: +The vLLM worker has a few limitations: -* The function and tool APIs are not currently supported. +* Function and tool calling APIs are not currently supported. * Some OpenAI-specific features like moderation endpoints are not available. -* Vision models and multimodal capabilities depend on the underlying model support. +* Vision models and multimodal capabilities depend on the underlying model support in vLLM. ## Troubleshooting @@ -324,14 +380,15 @@ Common issues and their solutions: | Issue | Solution | | ------------------------- | --------------------------------------------------------------------- | -| "Invalid model" error | Verify your model name matches what you deployed | -| Authentication error | Check that you're using your Runpod API key, not an OpenAI key | -| Timeout errors | Increase client timeout settings for large models | -| Incompatible responses | Set `RAW_OPENAI_OUTPUT=1` in your environment variables | -| Different response format | Some models may have different output formatting; use a chat template | +| "Invalid model" error | Verify your model name matches what you deployed. | +| Authentication error | Check that you're using your Runpod API key, not an OpenAI key. | +| Timeout errors | Increase client timeout settings for large models. | +| Incompatible responses | Set `RAW_OPENAI_OUTPUT=1` in your environment variables. | +| Different response format | Some models may have different output formatting; use a chat template. | ## Next steps -* [Learn how to send vLLM requests.](/serverless/vllm/vllm-requests) -* [Explore Runpod endpoint operations.](/serverless/endpoints/operations) -* [Explore the OpenAI API documentation.](https://platform.openai.com/docs/api-reference) +* [Learn how to send vLLM requests using Runpod's native API](/serverless/vllm/vllm-requests). +* [Explore environment variables for customization](/serverless/vllm/environment-variables). +* [Review all Serverless endpoint operations](/serverless/endpoints/send-requests). +* [Explore the OpenAI API documentation](https://platform.openai.com/docs/api-reference). diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx index 0e50052e..9eabe2d4 100644 --- a/serverless/vllm/overview.mdx +++ b/serverless/vllm/overview.mdx @@ -1,81 +1,124 @@ --- -title: "vLLM worker overview" +title: "vLLM workers overview" sidebarTitle: "Overview" +description: "Learn what vLLM is, how it works, and why you should use it for deploying large language models on Runpod Serverless." --- -vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on Runpod's [Serverless infrastructure](/serverless/overview). By leveraging Runpod's vLLM workers, you can quickly deploy state-of-the-art language models with optimized performance, flexible scaling, and cost-effective operation. +vLLM workers let you deploy and serve large language models on Runpod Serverless. They use vLLM, a high-performance inference engine, to deliver fast and efficient LLM inference with automatic scaling. -For detailed information on model compatibility and configuration options, check out the [vLLM worker GitHub repository](https://github.com/runpod-workers/worker-vllm). +## What is vLLM? -## Key features +vLLM is an open-source inference engine designed to serve large language models efficiently. It maximizes throughput and minimizes latency when running LLM inference workloads. -vLLM workers offer several advantages that make them ideal for LLM deployment: +vLLM workers include the vLLM engine with GPU optimizations and support for both OpenAI's API and Runpod's native API. You can deploy any supported model from Hugging Face with minimal configuration and start serving requests immediately. The workers run on Runpod Serverless, which automatically scales based on demand. -* **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference. -* **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key. -* **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others. -* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. -* **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis. +## How vLLM works -## Deployment options +vLLM uses several advanced techniques to achieve high performance when serving LLMs. Understanding these can help you optimize your deployments and troubleshoot issues. + +### PagedAttention for memory efficiency + +PagedAttention is the key innovation in vLLM. It dramatically improves how GPU memory is used during inference. Traditional LLM serving wastes memory by pre-allocating large contiguous blocks for key-value (KV) caches. PagedAttention breaks the KV cache into smaller pages, similar to how operating systems manage memory. + +This reduces memory waste and allows vLLM to serve more requests concurrently on the same GPU. You can handle higher throughput or serve larger models on smaller GPUs. + +### Continuous batching + +vLLM uses continuous batching (also called dynamic batching) to process multiple requests simultaneously. Unlike traditional batching, which waits for a batch to fill up before processing, continuous batching processes requests as they arrive and adds new requests to the batch as soon as previous ones complete. + +This keeps your GPU busy and reduces latency for individual requests, especially during periods of variable traffic. + +### Request lifecycle + +When you send a request to a vLLM worker endpoint: + +1. The request arrives at Runpod Serverless infrastructure. +2. If no worker is available, the request is queued and a worker starts automatically. +3. The worker loads your model from Hugging Face (or from the pre-baked Docker image). +4. vLLM processes the request using PagedAttention and continuous batching. +5. The response is returned to your application. +6. If there are no more requests, the worker scales down to zero after a configured timeout. + +vLLM endpoints use the same `/run` and `/runsync` operations as other Runpod Serverless endpoints. The only difference is the input format and the specialized LLM processing inside the worker. + +## Why use vLLM workers? + +vLLM workers offer several advantages over other LLM deployment options. + +### Performance and efficiency + +vLLM's PagedAttention and continuous batching deliver significantly better throughput than traditional serving methods. You can serve 2-3x more requests per GPU compared to naive implementations, which directly translates to lower costs and better user experiences. + +### OpenAI API compatibility -There are two ways to deploy a vLLM worker: +vLLM workers provide a drop-in replacement for OpenAI's API. If you're already using the OpenAI Python client or any other OpenAI-compatible library, you can switch to your Runpod endpoint by changing just two lines of code: the API key and the base URL. Your existing prompts, parameters, and response handling code continue to work without modification. -### Option 1: Quick deploy a vLLM endpoint +### Model flexibility -This is the simplest approach. Use Runpod's UI to deploy a model directly from Hugging Face with minimal configuration. For step-by-step instructions, see [Deploy a vLLM worker](/serverless/vllm/get-started). +You can deploy virtually any model available on Hugging Face, including popular options like Llama, Mistral, Qwen, Gemma, and thousands of others. vLLM supports a wide range of model architectures out of the box, and new architectures are added regularly. - +### Auto-scaling and cost efficiency -Quick-deployed workers will download models during initialization, which can take some time depending on the model selected. If you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time (using the Docker image method below), as this can significantly reduce cold start times. +Runpod Serverless automatically scales your vLLM workers from zero to many based on demand. You only pay for the seconds when workers are actively processing requests. This makes vLLM workers ideal for workloads with variable traffic patterns or when you're getting started and don't want to pay for idle capacity. - +### Production-ready features -### Option 2: Deploy using a Docker image +vLLM workers come with features that make them suitable for production deployments, including streaming responses, configurable context lengths, quantization support (AWQ, GPTQ), multi-GPU tensor parallelism, and comprehensive error handling. -Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings). +## Deployment options + +There are two ways to deploy vLLM workers on Runpod. + +### Using pre-built Docker images + +This is the fastest and most common approach. Runpod provides pre-built vLLM worker images that you can deploy directly from the console. You specify your model name as an environment variable, and the worker downloads it from Hugging Face during initialization. + +This method is ideal for getting started quickly, testing different models, or deploying models that change frequently. However, model download time adds to your cold start latency. -Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image. +### Building custom Docker images with models baked in -You can add new functionality your vLLM worker deployment by customizing its [handler function](/serverless/workers/handler-functions). +For production deployments where cold start time matters, you can build a custom Docker image that includes your model weights. This eliminates download time and can reduce cold starts from minutes to seconds. + +This approach requires more upfront work but provides the best performance for production workloads with consistent traffic. ## Compatible models -You can deploy almost any model on [Hugging Face](https://huggingface.co/models?other=LLM) as a vLLM worker. You can find a full list of supported models architectures on the [GitHub README](https://github.com/runpod-workers/worker-vllm/blob/main/README.md#compatible-model-architectures). +vLLM supports most model architectures available on Hugging Face. You can deploy models from families including Llama (1, 2, 3, 3.1, 3.2), Mistral and Mixtral, Qwen2 and Qwen2.5, Gemma and Gemma 2, Phi (2, 3, 3.5, 4), DeepSeek (V2, V3, R1), GPT-2, GPT-J, OPT, BLOOM, Falcon, MPT, StableLM, Yi, and many others. -## How vLLM works +For a complete and up-to-date list of supported model architectures, see the [vLLM supported models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). + +## Performance considerations -When deployed to a [Serverless endpoint](/serverless/endpoints/overview), vLLM workers: +Several factors affect vLLM worker performance. -1. Download and load the specified LLM from Hugging Face or other compatible sources. -2. Optimize the model for inference using vLLM's techniques like continuous batching and PagedAttention. -3. Expose API endpoints for both [OpenAI-compatible requests](/serverless/vllm/openai-compatibility) and Runpod's native [endpoint request](/serverless/endpoints/send-requests) format. -4. Process incoming requests by dynamically allocating GPU resources. -5. Scale workers up or down based on traffic patterns. +**GPU selection** is the most important factor. Larger models require more VRAM, and inference speed scales with GPU memory bandwidth. For 7B parameter models, an A10G or better is recommended. For 70B+ models, you'll need an A100 or H100. See [GPU types](/references/gpu-types) for details on available GPUs. + +**Model size** directly impacts both loading time and inference speed. Smaller models (7B parameters) load quickly and generate tokens fast. Larger models (70B+ parameters) provide better quality but require more powerful GPUs and have higher latency. + +**Quantization** reduces model size and memory requirements by using lower-precision weights. Methods like AWQ and GPTQ can reduce memory usage by 2-4x with minimal quality loss. This lets you run larger models on smaller GPUs or increase throughput on a given GPU. + +**Context length** affects memory requirements and processing time. Longer contexts require more memory for the KV cache and take longer to process. Set `MAX_MODEL_LEN` to the minimum value that meets your needs. + +**Concurrent requests** benefit from vLLM's continuous batching, but too many concurrent requests can exceed GPU memory and cause failures. The `MAX_NUM_SEQS` environment variable controls the maximum number of concurrent sequences. ## Use cases -vLLM workers are an effective choice for: +vLLM workers are ideal for several types of applications. -* High-performance inference for text generation. -* Cost-effective scaling for LLM workloads. -* Integration with existing OpenAI-based applications. -* Deploying open-source models with commercial licenses. -* AI systems requiring both synchronous and streaming responses. +**Production LLM APIs** benefit from vLLM's high throughput and OpenAI compatibility. You can build scalable APIs for chatbots, content generation, code completion, or any other LLM-powered feature. -## Performance considerations +**Cost-effective scaling** is enabled by Serverless auto-scaling. If your traffic varies significantly throughout the day or week, vLLM workers automatically scale down to zero during quiet periods, saving costs compared to always-on servers. + +**OpenAI migration** is straightforward because vLLM provides API compatibility. You can migrate existing OpenAI-based applications to open-source models by changing only your endpoint URL and API key. -The performance of vLLM workers depends on several factors: +**Custom model hosting** lets you deploy fine-tuned or specialized models. If you've trained a custom model or fine-tuned an existing one, vLLM workers make it easy to serve it at scale. -* **GPU selection**: Larger models require more VRAM (A10G or better recommended for 7B+ parameter models). For a list of available GPUs, see [GPU types](/references/gpu-types) -* **Model size**: Affects both loading time and inference speed. -* **Quantization**: Options like AWQ or GPTQ can reduce memory requirements at a small quality cost. -* **Batch size settings**: Impact throughput and latency tradeoffs. -* **Context length**: Longer contexts require more memory and processing time. +**Development and experimentation** is cheaper with pay-per-second billing. You can test multiple models and configurations without worrying about idle costs. ## Next steps -* [Deploy a vLLM worker as a Serverless endpoint.](/serverless/vllm/get-started) -* [Send requests to a vLLM endpoint.](/serverless/vllm/vllm-requests) -* [Learn about Runpod's OpenAI API compatibility.](/serverless/vllm/openai-compatibility) -* [Deploy Google's Gemma model using a vLLM Worker.](/tutorials/serverless/run-gemma-7b) +Ready to deploy your first vLLM worker? Start with the [get started guide](/serverless/vllm/get-started) to deploy a model in minutes. + +Once your endpoint is running, learn how to send requests using [Runpod's native API](/serverless/vllm/vllm-requests) or the [OpenAI-compatible API](/serverless/vllm/openai-compatibility). + +For advanced configuration options, see the [environment variables documentation](/serverless/vllm/environment-variables). diff --git a/serverless/vllm/vllm-requests.mdx b/serverless/vllm/vllm-requests.mdx index 2f245095..dbe9b613 100644 --- a/serverless/vllm/vllm-requests.mdx +++ b/serverless/vllm/vllm-requests.mdx @@ -1,158 +1,275 @@ --- title: "Send requests to vLLM workers" sidebarTitle: "Send vLLM requests" +description: "Send requests to vLLM workers using Runpod's native API." --- -This guide covers different methods for sending requests to vLLM workers on Runpod, including code examples and best practices for Runpod's native API format. Use this guide to effectively integrate LLMs into your applications while maintaining control over performance and cost. +This guide covers how to send requests to vLLM workers using Runpod's native API format. vLLM workers use the same request operations as any other Runpod Serverless endpoint, with specialized input parameters for LLM inference. -## Requirements +## How vLLM requests work -* You've [created a Runpod account](/get-started/manage-accounts). -* You've created a [Runpod API key](/get-started/api-keys). -* You've installed [Python](https://www.python.org/downloads/). -* (For gated models) You've created a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). +vLLM workers are queue-based Serverless endpoints. They use the same `/run` and `/runsync` operations as other Runpod endpoints, following the standard [Serverless request structure](/serverless/endpoints/send-requests). -Many of the code samples below will require you to input your endpoint ID. You can find your endpoint ID on the endpoint details page: +The key difference is the input format. vLLM workers expect specific parameters for language model inference, such as prompts, messages, and sampling parameters. The worker's handler processes these inputs using the vLLM engine and returns generated text. - - - +## Request operations -## Runpod API requests +vLLM endpoints support both synchronous and asynchronous requests. -Runpod's native API provides additional flexibility and control over your requests. These requests follow Runpod's standard [endpoint operations](/serverless/endpoints/operations) format. +### Asynchronous requests with `/run` -### Python Example - -Replace `[RUNPOD_API_KEY]` with your Runpod API key. +Use `/run` to submit a job that processes in the background. You'll receive a job ID immediately, then poll for results using the `/status` endpoint. ```python import requests url = "https://api.runpod.ai/v2//run" -headers = {"Authorization": "Bearer [RUNPOD_API_KEY]", "Content-Type": "application/json"} +headers = { + "Authorization": "Bearer ", + "Content-Type": "application/json" +} data = { "input": { - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Write a short poem."} - ], - "sampling_params": {"temperature": 0.7, "max_tokens": 100} + "prompt": "Explain quantum computing in simple terms.", + "sampling_params": { + "temperature": 0.7, + "max_tokens": 200 + } } } response = requests.post(url, headers=headers, json=data) -print(response.json()) +job_id = response.json()["id"] +print(f"Job ID: {job_id}") ``` -### cURL Example +### Synchronous requests with `/runsync` + +Use `/runsync` to wait for the complete response in a single request. The client blocks until processing is complete. + +```python +import requests -Run the following command in your local terminal, replacing `[RUNPOD_API_KEY]` with your Runpod API key and `[RUNPOD_ENDPOINT_ID]` with your vLLM endpoint ID. +url = "https://api.runpod.ai/v2//runsync" +headers = { + "Authorization": "Bearer ", + "Content-Type": "application/json" +} + +data = { + "input": { + "prompt": "Explain quantum computing in simple terms.", + "sampling_params": { + "temperature": 0.7, + "max_tokens": 200 + } + } +} -```sh -curl -X POST "https://api.runpod.ai/v2/[RUNPOD_ENDPOINT_ID]/run" \ - -H "Authorization: Bearer [RUNPOD_API_KEY]" \ - -H "Content-Type: application/json" \ - -d '{ - "input": { - "prompt": "Write a haiku about nature.", - "sampling_params": {"temperature": 0.8, "max_tokens": 50} - } - }' +response = requests.post(url, headers=headers, json=data) +print(response.json()) ``` -## Request formats +For more details on request operations, see [Send API requests to Serverless endpoints](/serverless/endpoints/send-requests). + +## Input formats -vLLM workers accept two primary input formats: +vLLM workers accept two input formats for text generation. ### Messages format (for chat models) +Use the messages format for instruction-tuned models that expect conversation history. The worker automatically applies the model's chat template. + ```json { - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Tell me about the solar system."} - ] + "input": { + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is the capital of France?"} + ], + "sampling_params": { + "temperature": 0.7, + "max_tokens": 100 + } + } } ``` ### Prompt format (for text completion) +Use the prompt format for base models or when you want to provide raw text without a chat template. + ```json { - "prompt": "Summarize the following text: Climate change is a global challenge that affects..." + "input": { + "prompt": "The capital of France is", + "sampling_params": { + "temperature": 0.7, + "max_tokens": 50 + } + } +} +``` + +### Applying chat templates to prompts + +If you use the prompt format but want the model's chat template applied, set `apply_chat_template` to `true`. + +```json +{ + "input": { + "prompt": "What is the capital of France?", + "apply_chat_template": true, + "sampling_params": { + "temperature": 0.7, + "max_tokens": 100 + } + } } ``` ## Request input parameters -vLLM workers support various parameters to control generation behavior. Here are some commonly used parameters: +Here are all available parameters you can include in the `input` object of your request. + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `prompt` | `string` | None | Prompt string to generate text based on. | +| `messages` | `list[dict[str, str]]` | None | List of messages with `role` and `content` keys. The model's chat template will be applied automatically. Overrides `prompt`. | +| `apply_chat_template` | `bool` | `false` | Whether to apply the model's chat template to the `prompt`. | +| `sampling_params` | `dict` | `{}` | Sampling parameters to control generation (see Sampling parameters section below). | +| `stream` | `bool` | `false` | Whether to enable streaming of output. If `true`, responses are streamed as they are generated. | +| `max_batch_size` | `int` | env `DEFAULT_BATCH_SIZE` | The maximum number of tokens to stream per HTTP POST call. | +| `min_batch_size` | `int` | env `DEFAULT_MIN_BATCH_SIZE` | The minimum number of tokens to stream per HTTP POST call. | +| `batch_size_growth_factor` | `int` | env `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | The growth factor by which `min_batch_size` multiplies for each call until `max_batch_size` is reached. | + +## Sampling parameters + +Sampling parameters control how the model generates text. Include them in the `sampling_params` dictionary in your request. + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `n` | `int` | `1` | Number of output sequences generated from the prompt. The top `n` sequences are returned. | +| `best_of` | `int` | `n` | Number of output sequences generated from the prompt. The top `n` sequences are returned from these `best_of` sequences. Must be ≥ `n`. Treated as beam width in beam search. | +| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `repetition_penalty` | `float` | `1.0` | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. | +| `temperature` | `float` | `1.0` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | +| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | +| `top_k` | `int` | `-1` | Controls the number of top tokens to consider. Set to -1 to consider all tokens. | +| `min_p` | `float` | `0.0` | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. | +| `use_beam_search` | `bool` | `false` | Whether to use beam search instead of sampling. | +| `length_penalty` | `float` | `1.0` | Penalizes sequences based on their length. Used in beam search. | +| `early_stopping` | `bool` or `string` | `false` | Controls stopping condition in beam search. Can be `true`, `false`, or `"never"`. | +| `stop` | `string` or `list[str]` | `None` | String(s) that stop generation when produced. The output will not contain these strings. | +| `stop_token_ids` | `list[int]` | `None` | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. | +| `ignore_eos` | `bool` | `false` | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. | +| `max_tokens` | `int` | `16` | Maximum number of tokens to generate per output sequence. | +| `min_tokens` | `int` | `0` | Minimum number of tokens to generate per output sequence before EOS or stop sequences. | +| `skip_special_tokens` | `bool` | `true` | Whether to skip special tokens in the output. | +| `spaces_between_special_tokens` | `bool` | `true` | Whether to add spaces between special tokens in the output. | +| `truncate_prompt_tokens` | `int` | `None` | If set, truncate the prompt to this many tokens. | + +## Streaming responses + +Enable streaming to receive tokens as they're generated instead of waiting for the complete response. + +```python +import requests +import json -| Parameter | Type | Description | -| -------------------- | ------------------- | ----------------------------------------------------------- | -| `temperature` | `float` | Controls randomness (0.0-1.0) | -| `max_tokens` | `int` | Maximum number of tokens to generate | -| `top_p` | `float` | Nucleus sampling parameter (0.0-1.0) | -| `top_k` | `int` | Limits consideration to top k tokens | -| `stop` | `string` or `array` | Sequence(s) at which to stop generation | -| `repetition_penalty` | `float` | Penalizes repetition (1.0 = no penalty) | -| `presence_penalty` | `float` | Penalizes new tokens already in text | -| `frequency_penalty` | `float` | Penalizes token frequency | -| `min_p` | `float` | Minimum probability threshold relative to most likely token | -| `best_of` | `int` | Number of completions to generate server-side | -| `use_beam_search` | `boolean` | Whether to use beam search instead of sampling | +url = "https://api.runpod.ai/v2/ENDPOINT_ID/run" +headers = { + "Authorization": "Bearer RUNPOD_API_KEY", + "Content-Type": "application/json" +} -You can find a complete list of request input parameters on the [GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#usage-standard-non-openai). +data = { + "input": { + "prompt": "Write a short story about a robot.", + "sampling_params": { + "temperature": 0.8, + "max_tokens": 500 + }, + "stream": True + } +} -## Error handling +response = requests.post(url, headers=headers, json=data) +job_id = response.json()["id"] + +# Stream the results +stream_url = f"https://api.runpod.ai/v2//stream/{job_id}" +with requests.get(stream_url, headers=headers, stream=True) as r: + for line in r.iter_lines(): + if line: + print(json.loads(line)) +``` + +Replace `ENDPOINT_ID` and `RUNPOD_API_KEY` with your actual values. -When working with vLLM workers, it's crucial to implement proper error handling to address potential issues such as network timeouts, rate limiting, worker initialization delays, and model loading errors. +For more information on streaming, see the [stream operation documentation](/serverless/endpoints/send-requests#stream). -Here is an example error handling implementation: +## Error handling + +Implement proper error handling to manage network timeouts, rate limiting, worker initialization delays, and model loading errors. ```python import requests import time -import backoff # pip install backoff - -@backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_tries=5) -def send_request(url, headers, payload): - response = requests.post(url, headers=headers, json=payload) - response.raise_for_status() # Raises an exception for 4XX/5XX responses - return response.json() - -try: - result = send_request(url, headers, payload) - print(f"Success: {result}") -except requests.exceptions.HTTPError as e: - if e.response.status_code == 429: - print("Rate limit exceeded. Try again later.") - elif e.response.status_code == 500: - print("Server error. The model may be having trouble loading.") - else: - print(f"HTTP error: {e}") -except requests.exceptions.ConnectionError: - print("Connection error. Check your network and endpoint ID.") -except requests.exceptions.Timeout: - print("Request timed out. The model may be processing a large batch.") -except Exception as e: - print(f"An unexpected error occurred: {e}") + +def send_vllm_request(url, headers, payload, max_retries=3): + for attempt in range(max_retries): + try: + response = requests.post(url, headers=headers, json=payload, timeout=300) + response.raise_for_status() + return response.json() + except requests.exceptions.Timeout: + print(f"Request timed out. Attempt {attempt + 1}/{max_retries}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) # Exponential backoff + except requests.exceptions.HTTPError as e: + if e.response.status_code == 429: + print("Rate limit exceeded. Waiting before retry...") + time.sleep(5) + elif e.response.status_code >= 500: + print(f"Server error: {e.response.status_code}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) + else: + raise + except requests.exceptions.RequestException as e: + print(f"Request failed: {e}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) + + raise Exception("Max retries exceeded") + +# Usage +result = send_vllm_request(url, headers, data) ``` ## Best practices -Here are some best practices to keep in mind when creating your requests: +Follow these best practices when sending requests to vLLM workers. + +**Set appropriate timeouts** based on your model size and expected generation length. Larger models and longer generations require longer timeouts. + +**Implement retry logic** with exponential backoff for failed requests. This handles temporary network issues and worker initialization delays. + +**Use streaming for long responses** to provide a better user experience. Users see output immediately instead of waiting for the entire response. + +**Optimize sampling parameters** for your use case. Lower temperature for factual tasks, higher temperature for creative tasks. + +**Monitor response times** to identify performance issues. If requests consistently take longer than expected, consider using a more powerful GPU or optimizing your parameters. + +**Handle rate limits** gracefully by implementing queuing or request throttling in your application. -1. **Use appropriate timeouts**: Set timeouts based on your model size and complexity. -2. **Implement retry logic**: Add exponential backoff for failed requests. -3. **Optimize batch size**: Adjust request frequency based on model inference speed. -4. **Monitor response times**: Track performance to identify optimization opportunities. -5. **Use streaming for long responses**: Improve user experience for lengthy content generation. -6. **Cache frequent requests**: Reduce redundant API calls for common queries. -7. **Handle rate limits**: Implement queuing for high-volume applications. +**Cache common requests** when appropriate to reduce redundant API calls and improve response times. ## Next steps -* [Send requests using the OpenAI-compatible API.](/serverless/vllm/openai-compatibility) -* [Learn how to use Serverless endpoint operations.](/serverless/endpoints/operations) +* [Learn about OpenAI API compatibility](/serverless/vllm/openai-compatibility). +* [Explore environment variables for customization](/serverless/vllm/environment-variables). +* [Review all Serverless endpoint operations](/serverless/endpoints/send-requests).