From 5c3166b3fc60c6b521507016f3695f01a9de5dfe Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Mon, 29 Sep 2025 11:38:51 +0200 Subject: [PATCH 1/6] Updated vLLM integration guide --- docs/source/vllm_integration.md | 152 +++++++++++++++++++++++--------- 1 file changed, 111 insertions(+), 41 deletions(-) diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index 9240aed62ce..97857244400 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -8,12 +8,27 @@ TRL currently only supports vLLM versions `0.10.0`, `0.10.1`, and `0.10.2`. Plea + + +The following trainers currently support generation with vLLM: + +- [`GRPOTrainer`] +- [`OnlineDPO`] +- [`NashMD`] +- [`XPOTrainer`] +- [`RLOOTrainer`] + + + ## 🚀 How can I use vLLM with TRL to speed up training? 💡 **Note**: Resources required for this specific example: a single node with 8 GPUs. -vLLM server and TRL trainer must use different CUDA devices to avoid conflicts. + +When using vLLM with TRL, the **vLLM server** and the **trainer** must run on **separate CUDA devices** to prevent conflicts. +For guidance on configuring this properly, see [Modes of using vLLM during training](#modes-of-using-vllm-during-training). + First, install vLLM using the following command: @@ -67,15 +82,19 @@ And the train command on separate GPUs from the server: CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py ``` -## 🎬 Flashback: Why do we need to use vLLM in online methods? +## Why using vLLM? + +### 🎬 Flashback: Why do we need to use vLLM in online methods? Online methods like GRPO or Online DPO require the model to generate completions during training, which are then used to compute reward signals. However, generation can be extremely time-consuming, especially with large or reasoning models. In the default setup (without vLLM), completions are generated using the [(unwrapped) model's `generate` method](https://github.com/huggingface/trl/blob/f3e8c2304428ef16e9ae5de9e5741ed84d533b7b/trl/trainer/grpo_trainer.py#L965C39-L965C66). This approach quickly becomes a major bottleneck — generation is slow and inefficient, particularly for large batches or models. As a result, training times increase significantly, and overall efficiency drops. To address this, we turn to vLLM, which enables much faster and more scalable generation, helping eliminate this bottleneck in online methods. -## 🤔 How does vLLM solve the slow generation issue? +### 🤔 How does vLLM solve the slow generation issue? If you've ever done autoregressive decoder training, you know all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to later generate subsequent tokens based on them. These cached key and value tensors are often referred to as the KV cache. However, storing the KV cache occupies a lot of memory, so vLLM uses a technique called **PagedAttention** to solve this problem. PagedAttention, which is inspired by the OS’s virtual memory concept, stores continuous keys and values in **non-contiguous memory space**, which is much more efficient. The details of this are beyond the scope of this document, but in short, it allows the model to store the keys and values in a more efficient way, reducing the memory footprint and speeding up the generation process. If you are interested, make sure to check out the [vLLM PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) for more details. -## 🤔 What exactly happens when you run `trl vllm-serve --model `? +## How vLLM Works (Under the Hood) 🔍 + +### 🤔 What exactly happens when you run `trl vllm-serve --model `? When you run for example @@ -96,7 +115,7 @@ Each worker operates independently and processes a chunk of the incoming request This GPU-to-GPU communication is managed efficiently by NVIDIA’s NCCL library. The communication mainly ensures that each GPU gets its correct portion of the incoming requests — it’s lightweight and doesn’t interfere with generation itself. Separately, the number of completions to generate per prompt is controlled by the `num_generations` setting in the GRPO config. For instance, if you set `num_generations=2` (like in the picture above), each prompt will have 2 completions. So, with 8 prompts and `num_generations=2`, you would end up with 16 completions total — regardless of the number of GPUs or parallelism settings. -## 🥸 More detail on what happens under the hood when running the server +### 🥸 More detail on what happens under the hood when running the server * The vLLM server starts by running the command: `trl vllm-serve --model Qwen/Qwen2.5-7B`. * Once the server is running, it generates completions based on requests from the client (trainer) using `vllm_client.generate` [here](https://github.com/huggingface/trl/blob/cc044e35b285be7dc062764b3364e1e684db4c7c/trl/trainer/grpo_trainer.py#L1025-L1035). @@ -118,19 +137,21 @@ For example, if you want to use GPUs 4–7 for training while the server runs on CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py ``` -## 🍷 More customization options with vLLM? +## Advanced usage + +### 🍷 More customization options with vLLM? You can customize the server configuration by passing additional arguments. ``` $ trl vllm-serve --help -usage: trl vllm-serve [-h] --model MODEL [--revision REVISION] [--tensor_parallel_size TENSOR_PARALLEL_SIZE] - [--data_parallel_size DATA_PARALLEL_SIZE] [--host HOST] [--port PORT] - [--gpu_memory_utilization GPU_MEMORY_UTILIZATION] [--dtype DTYPE] [--max_model_len MAX_MODEL_LEN] - [--enable_prefix_caching ENABLE_PREFIX_CACHING] [--enforce_eager ENFORCE_EAGER] [--log_level LOG_LEVEL] +usage: trl vllm-serve [-h] --model MODEL [--revision REVISION] [--tensor_parallel_size TENSOR_PARALLEL_SIZE] [--data_parallel_size DATA_PARALLEL_SIZE] [--host HOST] + [--port PORT] [--gpu_memory_utilization GPU_MEMORY_UTILIZATION] [--dtype DTYPE] [--max_model_len MAX_MODEL_LEN] + [--enable_prefix_caching ENABLE_PREFIX_CACHING] [--enforce_eager [ENFORCE_EAGER]] [--kv_cache_dtype KV_CACHE_DTYPE] + [--trust_remote_code [TRUST_REMOTE_CODE]] [--log_level LOG_LEVEL] [--vllm_model_impl VLLM_MODEL_IMPL] options: - -h, --help Show this help message and exit + -h, --help show this help message and exit --model MODEL Model name or path to load the model from. (default: None) --revision REVISION Revision to use for the model. If not specified, the default branch will be used. (default: None) --tensor_parallel_size TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE @@ -140,39 +161,33 @@ options: --host HOST Host address to run the server on. (default: 0.0.0.0) --port PORT Port to run the server on. (default: 8000) --gpu_memory_utilization GPU_MEMORY_UTILIZATION, --gpu-memory-utilization GPU_MEMORY_UTILIZATION - Ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache on the device - dedicated to generation powered by vLLM. Higher values will increase the KV cache size and thus improve the - model's throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors during - initialization. (default: 0.9) - --dtype DTYPE Data type to use for vLLM generation. If set to 'auto', the data type will be automatically determined based on - the model configuration. Find the supported values in the vLLM documentation. (default: auto) + Ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache on the device dedicated to generation + powered by vLLM. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, + it may cause out-of-memory (OOM) errors during initialization. (default: 0.9) + --dtype DTYPE Data type to use for vLLM generation. If set to 'auto', the data type will be automatically determined based on the model configuration. + Find the supported values in the vLLM documentation. (default: auto) --max_model_len MAX_MODEL_LEN, --max-model-len MAX_MODEL_LEN - If set, the `max_model_len` to use for vLLM. This can be useful when running with reduced - `vllm_gpu_memory_utilization`, leading to a reduced KV cache size. If not set, vLLM will use the model context - size, which might be much larger than the KV cache, leading to inefficiencies. (default: None) + If set, the `max_model_len` to use for vLLM. This can be useful when running with reduced `vllm_gpu_memory_utilization`, leading to a + reduced KV cache size. If not set, vLLM will use the model context size, which might be much larger than the KV cache, leading to + inefficiencies. (default: None) --enable_prefix_caching ENABLE_PREFIX_CACHING, --enable-prefix-caching ENABLE_PREFIX_CACHING - Whether to enable prefix caching in vLLM. If set to `True`, ensure that the model and the hardware support this - feature. (default: None) - --enforce_eager ENFORCE_EAGER, --enforce-eager ENFORCE_EAGER - Whether to enforce eager execution. If set to `True`, we will disable CUDA graph and always execute the model - in eager mode. If `False` (default behavior), we will use CUDA graph and eager execution in hybrid. (default: - None) + Whether to enable prefix caching in vLLM. If set to `True`, ensure that the model and the hardware support this feature. (default: None) + --enforce_eager [ENFORCE_EAGER], --enforce-eager [ENFORCE_EAGER] + Whether to enforce eager execution. If set to `True`, we will disable CUDA graph and always execute the model in eager mode. If `False` + (default behavior), we will use CUDA graph and eager execution in hybrid. (default: False) + --kv_cache_dtype KV_CACHE_DTYPE, --kv-cache-dtype KV_CACHE_DTYPE + Data type to use for KV cache. If set to 'auto', the dtype will default to the model data type. (default: auto) + --trust_remote_code [TRUST_REMOTE_CODE], --trust-remote-code [TRUST_REMOTE_CODE] + Whether to trust remote code when loading models. Set to True to allow executing code from model repositories. This is required for some + custom models but introduces security risks. (default: False) --log_level LOG_LEVEL, --log-level LOG_LEVEL - Log level for uvicorn. Possible choices: 'critical', 'error', 'warning', 'info', 'debug', 'trace'. (default: - info) + Log level for uvicorn. Possible choices: 'critical', 'error', 'warning', 'info', 'debug', 'trace'. (default: info) + --vllm_model_impl VLLM_MODEL_IMPL, --vllm-model-impl VLLM_MODEL_IMPL + Model implementation to use for vLLM. Must be one of `transformers` or `vllm`. `transformers`: Use the `transformers` backend for model + implementation. `vllm`: Use the `vllm` library for model implementation. (default: vllm) ``` -## 🥳 Okay, now that we have the server running, how can we use it to generate completions? - -Run the training script and pass `use_vllm=True` in the training arguments: - -```python -from trl import GRPOConfig - -training_args = GRPOConfig(..., use_vllm=True) -``` - -## 💆🏻‍♀️ What's the best distributed setup? +### 💆🏻‍♀️ What's the best distributed setup? ![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tp_dp_throughput_8_gpus.png) ![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tp_dp_throughput_4_gpus.png) @@ -192,11 +207,66 @@ Given these factors, our experiments on the Qwen model family (3B, 7B, 14B, 32B) * For reasonable-sized models (3B–14B) and a moderate context window (`max_len < 8k`), using full capacity for data parallelism gives better throughput. The setup `(tp=1, dp=8)` yields the best results. * For larger models (32B) and longer context windows (`max_len > 8k`), a smaller DP size combined with some model-side parallelism performs better. For example, `(tp=2, dp=4)` is a good setup for 32B models with a larger context window. -## vLLM with Transformers Backend +### vLLM with Transformers Backend + +vLLM can use the **Transformers backend** for model implementations, which works for both LLMs and VLMs. +To enable this, set `vllm_model_impl="transformers"` in your configuration or pass it via the command-line argument. -vLLM now supports transformers backend for model implementations. Simply passing in `transformers` in `vllm_model_impl` in configurations or through argument parser will set use transformers backend. This works for both LLMs and VLMs. See an example below, you can get more information [here](https://blog.vllm.ai/2025/04/11/transformers-backend.html). +For more details, check out [vLLM Transformers Backend](https://blog.vllm.ai/2025/04/11/transformers-backend.html). + +Example: ``` CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen 2.5-VL-3B-Instruct --tensor-parallel-size 1 --port 8000 --enforce_eager --vllm_model_impl transformers ``` + +### Modes of Using vLLM During Training + +TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**. + +#### Server Mode + +In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP. +This setup is ideal if you have GPUs dedicated to inference. + +Example configuration: + +```python +from trl import GRPOConfig + +training_args = GRPOConfig( + ..., + use_vllm=True, + vllm_mode="server", # default value, can be omitted +) +``` + +#### Colocate Mode + +In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model. +This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. + + +```python +from trl import GRPOConfig + +training_args = GRPOConfig( + ..., + use_vllm=True, + vllm_mode="colocate", +) +``` + + + +Check the documentation of the trainer you are using for specific details on vLLM usage and parameters. + + + + + +To reduce GPU memory usage when running vLLM, consider [enabling vLLM sleep mode](https://huggingface.co/docs/trl/main/en/reducing_memory_usage#vllm-sleep-mode). + + + From 5ce2bd48d62dbddf76191a647960a2cfdf7e14ad Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Mon, 29 Sep 2025 11:42:20 +0200 Subject: [PATCH 2/6] Updated Tips syntax --- docs/source/vllm_integration.md | 50 ++++++++++++--------------------- 1 file changed, 18 insertions(+), 32 deletions(-) diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index 97857244400..ff54bcc11f5 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -2,34 +2,25 @@ This document will guide you through the process of using vLLM with TRL for faster generation in online methods like GRPO and Online DPO. We first summarize a tl;dr on how to use vLLM with TRL, and then we will go into the details of how it works under the hood. Let's go! 🔥 - - -TRL currently only supports vLLM versions `0.10.0`, `0.10.1`, and `0.10.2`. Please ensure you have one of these versions installed to avoid compatibility issues. - - - - - -The following trainers currently support generation with vLLM: - -- [`GRPOTrainer`] -- [`OnlineDPO`] -- [`NashMD`] -- [`XPOTrainer`] -- [`RLOOTrainer`] - - +> [!WARNING] +> TRL currently only supports vLLM versions `0.10.0`, `0.10.1`, and `0.10.2`. Please ensure you have one of these versions installed to avoid compatibility issues. + +> [!TIP] +> The following trainers currently support generation with vLLM: +> +> - [`GRPOTrainer`] +> - [`OnlineDPO`] +> - [`NashMD`] +> - [`XPOTrainer`] +> - [`RLOOTrainer`] ## 🚀 How can I use vLLM with TRL to speed up training? 💡 **Note**: Resources required for this specific example: a single node with 8 GPUs. - - -When using vLLM with TRL, the **vLLM server** and the **trainer** must run on **separate CUDA devices** to prevent conflicts. -For guidance on configuring this properly, see [Modes of using vLLM during training](#modes-of-using-vllm-during-training). - - +> [!WARNING] +> When using vLLM with TRL, the **vLLM server** and the **trainer** must run on **separate CUDA devices** to prevent conflicts. +> For guidance on configuring this properly, see [Modes of using vLLM during training](#modes-of-using-vllm-during-training). First, install vLLM using the following command: @@ -258,15 +249,10 @@ training_args = GRPOConfig( ) ``` - - -Check the documentation of the trainer you are using for specific details on vLLM usage and parameters. - - - - +> [!WARNING] +> Check the documentation of the trainer you are using for specific details on vLLM usage and parameters. -To reduce GPU memory usage when running vLLM, consider [enabling vLLM sleep mode](https://huggingface.co/docs/trl/main/en/reducing_memory_usage#vllm-sleep-mode). - +> [!WARNING] +> To reduce GPU memory usage when running vLLM, consider [enabling vLLM sleep mode](https://huggingface.co/docs/trl/main/en/reducing_memory_usage#vllm-sleep-mode). From ac26853bba60f8f955a93a12d12604688e9b9803 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Mon, 29 Sep 2025 11:53:29 +0200 Subject: [PATCH 3/6] Nits --- docs/source/vllm_integration.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index ff54bcc11f5..e99b89deb91 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -1,6 +1,6 @@ # vLLM Integration -This document will guide you through the process of using vLLM with TRL for faster generation in online methods like GRPO and Online DPO. We first summarize a tl;dr on how to use vLLM with TRL, and then we will go into the details of how it works under the hood. Let's go! 🔥 +This document will guide you through the process of using vLLM with TRL for faster generation in online methods like GRPO and Online DPO. We first summarize a tl;dr on how to use vLLM with TRL, and then we will go into the details of how it works under the hood. > [!WARNING] > TRL currently only supports vLLM versions `0.10.0`, `0.10.1`, and `0.10.2`. Please ensure you have one of these versions installed to avoid compatibility issues. @@ -10,7 +10,7 @@ This document will guide you through the process of using vLLM with TRL for fast > > - [`GRPOTrainer`] > - [`OnlineDPO`] -> - [`NashMD`] +> - [`NashMDTrainer`] > - [`XPOTrainer`] > - [`RLOOTrainer`] @@ -254,5 +254,5 @@ training_args = GRPOConfig( > [!WARNING] -> To reduce GPU memory usage when running vLLM, consider [enabling vLLM sleep mode](https://huggingface.co/docs/trl/main/en/reducing_memory_usage#vllm-sleep-mode). +> To reduce GPU memory usage when running vLLM, consider [enabling vLLM sleep mode](reducing_memory_usage#vllm-sleep-mode). From 08e06913ad96d78f4ca40f6d9e2c6f546ad62e94 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Mon, 29 Sep 2025 17:43:58 +0200 Subject: [PATCH 4/6] Update docs/source/vllm_integration.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --- docs/source/vllm_integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index e99b89deb91..dead2168a84 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -9,7 +9,7 @@ This document will guide you through the process of using vLLM with TRL for fast > The following trainers currently support generation with vLLM: > > - [`GRPOTrainer`] -> - [`OnlineDPO`] +> - [`OnlineDPOTrainer`] > - [`NashMDTrainer`] > - [`XPOTrainer`] > - [`RLOOTrainer`] From 522ee752476278431f05b574433e8c5ced6d3872 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Mon, 29 Sep 2025 17:55:50 +0200 Subject: [PATCH 5/6] Update docs/source/vllm_integration.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --- docs/source/vllm_integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index dead2168a84..b670b79ea72 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -134,7 +134,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py You can customize the server configuration by passing additional arguments. -``` +```txt $ trl vllm-serve --help usage: trl vllm-serve [-h] --model MODEL [--revision REVISION] [--tensor_parallel_size TENSOR_PARALLEL_SIZE] [--data_parallel_size DATA_PARALLEL_SIZE] [--host HOST] [--port PORT] [--gpu_memory_utilization GPU_MEMORY_UTILIZATION] [--dtype DTYPE] [--max_model_len MAX_MODEL_LEN] From 3ec07421e8816e4ea5d765d1f34ab6b094987c7c Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Tue, 30 Sep 2025 13:24:41 +0200 Subject: [PATCH 6/6] Added different trainers --- docs/source/vllm_integration.md | 245 +++++++++++++++++++++++++++++++- 1 file changed, 244 insertions(+), 1 deletion(-) diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index 18853c46357..27c767f0076 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -35,12 +35,15 @@ Then run the server on specific GPUs (e.g., GPUs 0-3): CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2 ``` -Once the server is running, you can use it to generate completions for training. In the example below, we are using the `GRPOTrainer` to train a model using the vLLM server for generation. The `--tensor-parallel-size` and `--data-parallel-size` arguments control how the model and data are sharded across GPUs. +Once the server is running, you can use it to generate completions for training. In the example below, we are using the different supported trainers using the vLLM server for generation. The `--tensor-parallel-size` and `--data-parallel-size` arguments control how the model and data are sharded across GPUs. In this example, we are sharding two copies of the model across 4 GPUs. Increasing data parallelism increases throughput, while increasing tensor parallelism allows for serving larger models. Then, run the training script on different GPUs (e.g., GPUs 4-7) by passing `use_vllm=True` in the training arguments as follows: Sample of a simple `train.py` script: + + + ```python from datasets import load_dataset from trl import GRPOTrainer, GRPOConfig @@ -68,6 +71,129 @@ trainer = GRPOTrainer( trainer.train() ``` + + + +```python +from datasets import load_dataset +from trl import OnlineDPOTrainer, OnlineDPOConfig + +dataset = load_dataset("trl-lib/tldr", split="train") + +# Dummy reward function: count the number of unique characters in the completions +def reward_num_unique_chars(completions, **kwargs): + return [len(set(c)) for c in completions] + +training_args = OnlineDPOConfig( + output_dir="my_test", + use_vllm=True, + bf16=True, + gradient_checkpointing=True, +) + +trainer = OnlineDPOTrainer( + model="Qwen/Qwen2.5-7B", + args=training_args, + reward_funcs=reward_num_unique_chars, + train_dataset=dataset, +) + +trainer.train() +``` + + + + +```python +from datasets import load_dataset +from trl import NashMDTrainer, NashMDConfig + +dataset = load_dataset("trl-lib/tldr", split="train") + +# Dummy reward function: count the number of unique characters in the completions +def reward_num_unique_chars(completions, **kwargs): + return [len(set(c)) for c in completions] + +training_args = NashMDConfig( + output_dir="my_test", + use_vllm=True, + bf16=True, + gradient_checkpointing=True, +) + +trainer = NashMDTrainer( + model="Qwen/Qwen2.5-7B", + args=training_args, + reward_funcs=reward_num_unique_chars, + train_dataset=dataset, +) + +trainer.train() +``` + + + + +```python +from datasets import load_dataset +from trl import XPOTrainer, XPOConfig + +dataset = load_dataset("trl-lib/tldr", split="train") + +# Dummy reward function: count the number of unique characters in the completions +def reward_num_unique_chars(completions, **kwargs): + return [len(set(c)) for c in completions] + +training_args = XPOConfig( + output_dir="my_test", + use_vllm=True, + bf16=True, + gradient_checkpointing=True, +) + +trainer = XPOTrainer( + model="Qwen/Qwen2.5-7B", + args=training_args, + reward_funcs=reward_num_unique_chars, + train_dataset=dataset, +) + +trainer.train() +``` + + + + +```python +from datasets import load_dataset +from trl import RLOOTrainer, RLOOConfig + +dataset = load_dataset("trl-lib/tldr", split="train") + +# Dummy reward function: count the number of unique characters in the completions +def reward_num_unique_chars(completions, **kwargs): + return [len(set(c)) for c in completions] + +training_args = RLOOConfig( + output_dir="my_test", + use_vllm=True, + bf16=True, + gradient_checkpointing=True, +) + +trainer = RLOOTrainer( + model="Qwen/Qwen2.5-7B", + args=training_args, + reward_funcs=reward_num_unique_chars, + train_dataset=dataset, +) + +trainer.train() +``` + + + + And the train command on separate GPUs from the server: ```sh @@ -224,6 +350,9 @@ This setup is ideal if you have GPUs dedicated to inference. Example configuration: + + + ```python from trl import GRPOConfig @@ -234,11 +363,70 @@ training_args = GRPOConfig( ) ``` + + + +```python +from trl import OnlineDPOConfig + +training_args = OnlineDPOConfig( + ..., + use_vllm=True, + vllm_mode="server", # default value, can be omitted +) +``` + + + + +```python +from trl import NashMDConfig + +training_args = NashMDConfig( + ..., + use_vllm=True, + vllm_mode="server", # default value, can be omitted +) +``` + + + + +```python +from trl import XPOConfig + +training_args = XPOConfig( + ..., + use_vllm=True, + vllm_mode="server", # default value, can be omitted +) +``` + + + + +```python +from trl import RLOOConfig + +training_args = RLOOConfig( + ..., + use_vllm=True, + vllm_mode="server", # default value, can be omitted +) +``` + + + + #### Colocate Mode In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. +Example configuration: + + + ```python from trl import GRPOConfig @@ -250,6 +438,61 @@ training_args = GRPOConfig( ) ``` + + + +```python +from trl import OnlineDPOConfig + +training_args = OnlineDPOConfig( + ..., + use_vllm=True, + vllm_mode="colocate", +) +``` + + + + +```python +from trl import NashMDConfig + +training_args = NashMDConfig( + ..., + use_vllm=True, + vllm_mode="colocate", +) +``` + + + + +```python +from trl import XPOConfig + +training_args = XPOConfig( + ..., + use_vllm=True, + vllm_mode="colocate", +) +``` + + + + +```python +from trl import RLOOConfig + +training_args = RLOOConfig( + ..., + use_vllm=True, + vllm_mode="colocate", +) +``` + + + + > [!WARNING] > Check the documentation of the trainer you are using for specific details on vLLM usage and parameters.