-
Notifications
You must be signed in to change notification settings - Fork 648
Description
🚀 The feature, motivation and pitch
foreword and motivation
This is a foreword on mutable states and the forward pass.
Compared to history, people are now writing models with more types of state which need to be managed across consecutive forward passes of models. Generally, model implementers are left to write their own cache implementations for whatever states are required for the forward pass. Many types of states have fixed size per layer, so a static allocation is sufficient.
However, some types of state have a size which is data-dependent. For states with data-dependent size, the dependency is typically to the length of the input sequence of time steps. There may be others, but this feature request focuses on sequence length, as this dependency is common. Using a transformer example, the key-value cache is dynamic in sequence length, storing one key and one value state per layer per time-step. To pre-allocate a static buffer for this state yields two small problems:
(1) the static buffer state must be copied into the layer's key and value states, instead of those key and value states being computed over directly
(2) the static allocation of these cache lines can be very large, even for a very small input sequence.
Both of these impact system resources of models running on edge devices, where resources are constrained, and, latency of the forward pass.
feature
Design a dynamic allocation model for caching model states.
An example design:
- Lift cache allocation out of the model, and allocate cache dynamically based on the input sequence length. If generation would overrun the cache, resize the allocation and copy states. This resize is expensive, so start with a cache size which is some function of the input sequence length, and use a reasonable algorithm for resizes, or allow a policy governing resize mechanics to be provided.
- Provide a compile-friendly cache interface, implementation, or markers for dynamo. Some documentation on how the dynamic cache feature can be used.
pitch
Users of executorch will likely see shorter initialization times of their application, and smaller model residency. If an optimization can be captured related to computing over values loaded from cache, the forward latency should be reduced. Sequence length of a compiled model can be bounded by resources on the machine.
alternatives
I've considered using a static cache. The static cache is more or less sufficient for the majority of use cases, but as devices become more portable, their system resources also tend to decrease. In these constrained environments, a static solution requires a different implementation per compile target, which is onerous.