[RFC]: Per-request metrics for the offline API.

### Motivation.

In V0 when calling `LLM.generate()` the `metrics` field of the `RequestOutput` object was set to a `RequestMetrics` object:

```
@dataclass
class RequestMetrics:
    """Metrics associated with a request.

    Attributes:
        arrival_time: The time when the request arrived.
        first_scheduled_time: The time when the request was first scheduled.
        first_token_time: The time when the first token was generated.
        time_in_queue: The time the request spent in the queue.
        finished_time: The time when the request was finished.
        scheduler_time: The time spent in the scheduler when this request was
                        being considered by the scheduler.
        model_forward_time: The time spent in the model forward pass when this
                            request was in the batch.
        model_execute_time: The time spent in the model execute function. This
                            will include model forward, block/sync across
                            workers, cpu-gpu sync time and sampling time.
    """

    arrival_time: float
    last_token_time: float
    first_scheduled_time: Optional[float]
    first_token_time: Optional[float]
    time_in_queue: Optional[float]
    finished_time: Optional[float] = None
    scheduler_time: Optional[float] = None
    model_forward_time: Optional[float] = None
    model_execute_time: Optional[float] = None
```

In V1 this was removed and the field was returned as `None` instead. Since the discussion on the rationale of this removal wasn't easy to find, PR https://github.com/vllm-project/vllm/pull/24947 added stats back, but now as a `RequestStateStats` object:

```
class RequestStateStats:
    """Stats that need to be tracked across delta updates."""

    num_generation_tokens: int = 0

    # This is an engine frontend timestamp (wall-clock)
    arrival_time: float = 0.0

    # These are engine core timestamps (monotonic)
    queued_ts: float = 0.0
    scheduled_ts: float = 0.0
    first_token_ts: float = 0.0
    last_token_ts: float = 0.0

    # first token latency
    first_token_latency: float = 0.0
```

This PR then prompted the discussion on whether this is a useful feature to support and what the best way to present these metrics is. The purpose of this RFC is to structure this discussion.

### Proposed Change.

The expected results of this RFC are:

1. Hear from the community about the use cases for this feature
2. A decision on whether to support this issue
3. How to present the metrics, taking the nature of different kinds of timestamp into account https://docs.vllm.ai/en/latest/design/metrics.html#interval-calculations
4. If the offline API supports this feature, should the online API  also support it?

### Feedback Period.

2 weeks

### CC List.

@markmc @robertgshaw2-redhat @huijjj @DarkLight1337 @njhill @frank-wei 

### Any Other Things.

Related PRs and issues:

- https://github.com/vllm-project/vllm/pull/24947
- https://github.com/vllm-project/vllm/pull/17010
- https://github.com/vllm-project/vllm/pull/26210
- https://github.com/vllm-project/vllm/issues/15394

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Per-request metrics for the offline API. #26298

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Per-request metrics for the offline API. #26298

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions