Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 38 additions & 54 deletions serverless/vllm/get-started.mdx
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
---
title: "Deploy a vLLM worker"
description: "Deploy a large language model using Runpod's vLLM workers and start serving requests in minutes."
---

Learn how to deploy a large language model (LLM) using Runpod's preconfigured vLLM workers. By the end of this guide, you'll have a fully functional API endpoint that you can use to handle LLM inference requests.
Learn how to deploy a large language model using Runpod's vLLM workers. By the end of this guide, you'll have a fully functional Serverless endpoint that can handle LLM inference requests.

## What you'll learn

In this tutorial, you'll learn how to:

* Configure and deploy a vLLM worker using Runpod's Serverless platform.
* Configure and deploy a vLLM worker using Runpod Serverless.
* Select the appropriate hardware and scaling settings for your model.
* Set up environmental variables to customize your deployment.
* Set up environment variables to customize your deployment.
* Test your endpoint using the Runpod API.
* Troubleshoot common issues that might arise during deployment.

Expand All @@ -21,47 +22,33 @@ In this tutorial, you'll learn how to:

## Step 1: Choose your model

First, decide which LLM you want to deploy. The vLLM worker supports most Hugging Face models, including:
First, decide which LLM you want to deploy. The vLLM worker supports most models on Hugging Face, including:

* Llama 3 (e.g., `meta-llama/Llama-3.2-3B-Instruct`)
* Mistral (e.g., `mistralai/Ministral-8B-Instruct-2410`)
* Qwen3 (e.g., `Qwen/Qwen3-8B`)
* OpenChat (e.g., `openchat/openchat-3.5-0106`)
* Gemma (e.g., `google/gemma-3-1b-it`)
* Deepseek-R1 (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`)
* Phi-4 (e.g., `microsoft/Phi-4-mini-instruct`)
* Llama 3 (e.g., `meta-llama/Llama-3.2-3B-Instruct`).
* Mistral (e.g., `mistralai/Ministral-8B-Instruct-2410`).
* Qwen3 (e.g., `Qwen/Qwen3-8B`).
* OpenChat (e.g., `openchat/openchat-3.5-0106`).
* Gemma (e.g., `google/gemma-3-1b-it`).
* DeepSeek-R1 (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`).
* Phi-4 (e.g., `microsoft/Phi-4-mini-instruct`).

For this walkthrough, we'll use `openchat/openchat-3.5-0106`, but you can substitute this with [any compatible model](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#compatible-model-architectures).
For this walkthrough, we'll use `openchat/openchat-3.5-0106`, but you can substitute this with [any compatible model](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Step 2: Deploy using the Runpod console

The easiest way to deploy a vLLM worker is through the Runpod console:
The easiest way to deploy a vLLM worker is through Runpod's Ready-to-Deploy Repos:

1. Navigate to the [Serverless page](https://www.console.runpod.io/serverless).
1. Find the [vLLM repo](https://console.runpod.io/hub/runpod-workers/worker-vllm) in the Runpod Hub.

2. Under **Quick Deploy**, find **Serverless vLLM** and click **Configure**.
2. Click **Deploy**, using the latest vLLM worker version.

3. In the deployment modal:
3. In the **Model (optional)** field, end the model name: `openchat/openchat-3.5-0106`.
4. Click **Advanced** to expand the vLLM settings.
5. Set **Max Model Length** to `8192` (or an appropriate context length for your model).
6. Leave other settings at their defaults unless you have specific requirements, then click **Next**.
7. Click **Create Endpoint**

* Select a vLLM version (latest stable recommended).
* Under **Hugging Face Models**, enter your model: `openchat/openchat-3.5-0106`.
* If using a gated model, enter your **Hugging Face Token**.
* Click **Next**.

4. In the vLLM settings modal, under **LLM Settings**:

* Set **Max Model Length** to `8192` (or an appropriate context length for your model).
* Leave other settings at their defaults unless you have specific requirements.
* Click **Next**.

5. Make changes to the endpoint settings if you have specific requirements, then click **Deploy**.






Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads your model.
Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads the selected model.

<Tip>

Expand All @@ -71,12 +58,9 @@ For more details on how to optimize your endpoint, see [Endpoint configurations]

## Step 3: Understand your endpoint

While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it:

* Runpod is creating a Serverless endpoint with your specified configuration.
* The vLLM worker image is being deployed with your chosen model.
While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it.

Once deployment is complete, make a note of your **Endpoint ID**. You'll need this to make API requests.
Runpod is creating a Serverless endpoint with your specified configuration, and the vLLM worker image is being deployed using your chosen model. Once deployment is complete, make a note of your **Endpoint ID**, as you'll need this to make API requests.

<Frame>
<img src="/images/4a0706af-serverless-endpoint-id.png" />
Expand Down Expand Up @@ -127,27 +111,27 @@ When the workers finish processing your request, you should see output on the ri
}
```

## Step 5: Customize your model (optional)
## Step 5: Customize your deployment with environment variables (optional)

If you need to customize your model deployment, you can edit your endpoint settings to add environment variables. Here are some useful environment variables you might want to set:

* `MAX_MODEL_LEN`: Maximum context length (e.g., `16384`)
* `DTYPE`: Data type for model weights (`float16`, `bfloat16`, or `float32`)
* `GPU_MEMORY_UTILIZATION`: Controls VRAM usage (e.g., `0.95` for 95%)
* `CUSTOM_CHAT_TEMPLATE`: For models that need a custom chat template
* `OPENAI_SERVED_MODEL_NAME_OVERRIDE`: Change the model name to use in OpenAI requests
* `MAX_MODEL_LEN`: Maximum context length (e.g., `16384`).
* `DTYPE`: Data type for model weights (`float16`, `bfloat16`, or `float32`).
* `GPU_MEMORY_UTILIZATION`: Controls VRAM usage (e.g., `0.95` for 95%).
* `CUSTOM_CHAT_TEMPLATE`: For models that need a custom chat template.
* `OPENAI_SERVED_MODEL_NAME_OVERRIDE`: Change the model name to use in OpenAI requests.

To add or modify environment variables:

1. Go to your endpoint details page.
2. Select **Manage**, then select **Edit Endpoint**.
3. Expand the **Public Environment Variables** section.
4. Add/edit your desired variables.
4. Add or edit your desired variables.
5. Click **Save Endpoint**.

You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings).
You can find a full list of available environment variables in the [environment variables documentation](/serverless/vllm/environment-variables).

You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters).
You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per response. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests).

## Troubleshooting

Expand All @@ -160,11 +144,11 @@ If you encounter issues with your deployment:

## Next steps

Congratulations! You've successfully deployed a vLLM worker on Runpod's Serverless platform. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API.
Congratulations! You've successfully deployed a vLLM worker on Runpod Serverless. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API.

Next you can try:

* [Sending requests using the Runpod API.](/serverless/vllm/vllm-requests)
* [Learning about vLLM's OpenAI API compatibility.](/serverless/vllm/openai-compatibility)
* [Customizing your vLLM worker's handler function.](/serverless/workers/handler-functions)
* [Building a custom worker for more specialized workloads.](/serverless/workers/custom-worker)
* [Sending requests using the Runpod API](/serverless/vllm/vllm-requests).
* [Learning about vLLM's OpenAI API compatibility](/serverless/vllm/openai-compatibility).
* [Customizing your vLLM worker's handler function](/serverless/workers/handler-functions).
* [Building a custom worker for more specialized workloads](/serverless/workers/custom-worker).
Loading