Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 43 additions & 27 deletions docs/source/reducing_memory_usage.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# Reducing Memory Usage

> [!WARNING]
> Section under construction. Feel free to contribute!
Training workflows can often be optimized to **reduce memory consumption**, and TRL provides several built-in features to help achieve this.

Below, we outline these techniques and recommend experimenting with different combinations to determine which configuration works best for your specific setup.

Each method includes examples for the supported trainers. If you're unsure whether a technique is compatible with your trainer, please refer to the corresponding trainer documentation.

For additional strategies, such as **gradient checkpointing**, which is supported across all trainers, see the [`transformers` performance guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).

## Truncation

Expand Down Expand Up @@ -54,7 +59,7 @@ training_args = SFTConfig(..., max_length=...)

### How to choose the `max_length` value?

If `max_length` is too small, a significant portion of your tokens will be discarded and won't contribute to training. If it's too large, memory usage can spike, potentially leading to OOM (Out-Of-Memory) errors. Without packing or padding-free, a large `max_length` may also result in inefficient training, as many tokens will be padding.
If `max_length` is too small, a significant portion of your tokens will be discarded and won't contribute to training. If it's too large, memory usage can spike, potentially leading to out-of-memory (OOM) errors. Without packing or padding-free, a large `max_length` may also result in inefficient training, as many tokens will be padding.

To help you choose an appropriate value, we provide a utility to visualize the sequence length distribution in your dataset.

Expand All @@ -63,7 +68,7 @@ To help you choose an appropriate value, we provide a utility to visualize the s
## Packing

> [!TIP]
> This technique applies only to SFT.
> This technique is available only for **SFT** training and setups that use **FlashAttention** (or its variants).

[Truncation](#truncation) has several drawbacks:

Expand Down Expand Up @@ -92,12 +97,12 @@ training_args = SFTConfig(..., packing=True, max_length=512)

> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%.

For more information, see [Liger Kernel Integration](liger_kernel_integration)
For more information, see [Liger Kernel Integration](liger_kernel_integration).

To use Liger for reducing peak memory usage, use the following code snippet:

<hfoptions id="liger">
<hfoption id="DPO">

To use Liger for reducing peak memory usage, use the following code snippet:

```python
from trl import DPOConfig
Expand All @@ -107,8 +112,6 @@ training_args = DPOConfig(..., use_liger_loss=True)

</hfoption>
<hfoption id="GRPO">

To use Liger for reducing peak memory usage, use the following code snippet:

```python
from trl import GRPOConfig
Expand All @@ -118,8 +121,6 @@ training_args = GRPOConfig(..., use_liger_loss=True)

</hfoption>
<hfoption id="KTO">

To use Liger for reducing peak memory usage, use the following code snippet:

```python
from trl import KTOConfig
Expand Down Expand Up @@ -172,25 +173,40 @@ from trl import SFTConfig
training_args = SFTConfig(..., activation_offloading=True)
```

> [!WARNING]
> When using activation offloading with models that use Liger kernels, you must disable Liger cross entropy due to compatibility issues. The issue occurs specifically with `use_liger_kernel=True` because Liger cross entropy performs in-place operations which conflict with activation offloading. The default setting (`use_liger_kernel=False`) works:
>
> ```python
> # When using activation offloading with a model that uses Liger kernels:
> from trl import SFTConfig
>
> training_args = SFTConfig(
> activation_offloading=True,
> use_liger_kernel=False, # Disable Liger cross entropy
> # Other parameters...
> )
> ```

Under the hood, activation offloading implements PyTorch's [`saved_tensors_hooks`](https://pytorch.org/tutorials/intermediate/autograd_saved_tensors_hooks_tutorial.html#hooks-for-autograd-saved-tensors) to intercept activations during the forward pass. It intelligently manages which tensors to offload based on size and context, avoiding offloading output tensors which would be inefficient. For performance optimization, it can optionally use CUDA streams to overlap computation with CPU-GPU transfers.

## Padding Sequences to a Multiple

> [!TIP]
> This technique is supported for **SFT** and **Reward** trainers, and for setups using **FlashAttention** (and its variants).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need FA for this one


When enabled, this option ensures that all sequences are **padded to a multiple** of the specified value.
This can improve computational efficiency on some hardware by aligning sequence lengths to memory-friendly boundaries.

<hfoptions id="pad_to_multiple_of">
<hfoption id="SFT">

```python
from trl import SFTConfig

training_args = SFTConfig(..., pad_to_multiple_of=2048)
```

</hfoption>
<hfoption id="Reward">

```python
from trl import RewardConfig

training_args = RewardConfig(..., pad_to_multiple_of=2048)
```

</hfoption>
</hfoptions>

## Disabling model gathering for generation in online methods

When using DeepSpeed ZeRO-3, model weights are sharded across multiple GPUs. Online methods involve generating completions from the model as part of the training process. During this step, the model weights are temporarily gathered on a single GPU for generation. For very large models, this gathering can lead to out-of-memory (OOM) errors, as described in this issue: [#2250](https://github.com/huggingface/trl/issues/2250#issue-2598304204).
When using DeepSpeed ZeRO-3, model weights are sharded across multiple GPUs. Online methods involve generating completions from the model as part of the training process. During this step, the model weights are temporarily gathered on a single GPU for generation. For very large models, this gathering can lead to OOM errors, as described in this issue: [#2250](https://github.com/huggingface/trl/issues/2250#issue-2598304204).

If you encounter this issue, you can disable the gathering of model weights for generation by setting the following parameter:

Expand Down Expand Up @@ -237,7 +253,7 @@ This adjustment prevents model weights from being gathered, avoiding OOM errors,

## vLLM sleep mode

When using vLLM as the generation backend, you can enable _sleep mode_ to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them back to GPU VRAM when needed for weight synchronization and generation.
When using **vLLM** as the generation backend for online training methods, you can enable _sleep mode_ to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them back to GPU VRAM when needed for weight synchronization and generation.

<hfoptions id="vllm_sleep">
<hfoption id="GRPO">
Expand Down
Loading