Skip to content

[RFC]: Per-request metrics for the offline API. #26298

@maxdebayser

Description

@maxdebayser

Motivation.

In V0 when calling LLM.generate() the metrics field of the RequestOutput object was set to a RequestMetrics object:

@dataclass
class RequestMetrics:
    """Metrics associated with a request.

    Attributes:
        arrival_time: The time when the request arrived.
        first_scheduled_time: The time when the request was first scheduled.
        first_token_time: The time when the first token was generated.
        time_in_queue: The time the request spent in the queue.
        finished_time: The time when the request was finished.
        scheduler_time: The time spent in the scheduler when this request was
                        being considered by the scheduler.
        model_forward_time: The time spent in the model forward pass when this
                            request was in the batch.
        model_execute_time: The time spent in the model execute function. This
                            will include model forward, block/sync across
                            workers, cpu-gpu sync time and sampling time.
    """

    arrival_time: float
    last_token_time: float
    first_scheduled_time: Optional[float]
    first_token_time: Optional[float]
    time_in_queue: Optional[float]
    finished_time: Optional[float] = None
    scheduler_time: Optional[float] = None
    model_forward_time: Optional[float] = None
    model_execute_time: Optional[float] = None

In V1 this was removed and the field was returned as None instead. Since the discussion on the rationale of this removal wasn't easy to find, PR #24947 added stats back, but now as a RequestStateStats object:

class RequestStateStats:
    """Stats that need to be tracked across delta updates."""

    num_generation_tokens: int = 0

    # This is an engine frontend timestamp (wall-clock)
    arrival_time: float = 0.0

    # These are engine core timestamps (monotonic)
    queued_ts: float = 0.0
    scheduled_ts: float = 0.0
    first_token_ts: float = 0.0
    last_token_ts: float = 0.0

    # first token latency
    first_token_latency: float = 0.0

This PR then prompted the discussion on whether this is a useful feature to support and what the best way to present these metrics is. The purpose of this RFC is to structure this discussion.

Proposed Change.

The expected results of this RFC are:

  1. Hear from the community about the use cases for this feature
  2. A decision on whether to support this issue
  3. How to present the metrics, taking the nature of different kinds of timestamp into account https://docs.vllm.ai/en/latest/design/metrics.html#interval-calculations
  4. If the offline API supports this feature, should the online API also support it?

Feedback Period.

2 weeks

CC List.

@markmc @robertgshaw2-redhat @huijjj @DarkLight1337 @njhill @frank-wei

Any Other Things.

Related PRs and issues:

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions