-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Pass down metrics regarding RequestMetrics #26210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Support passing some inference metrics through request_output. Therefore, these metrics can be passed down through llm engine step(). It is very useful when users need to leverage these metrics Test Plan: NA Reviewed By: 22quinn Differential Revision: D83576815
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for passing more detailed inference metrics via RequestOutput. This is achieved by adding new fields to the RequestMetrics dataclass and populating it in the OutputProcessor.
My review has identified a couple of issues:
- The new logic for populating
RequestMetricsmisses some available data fromreq_state.statsand doesn't handle unset timestamps correctly. I've provided a suggestion to fix this. - Crucially, this new functionality is not covered by any tests. Adding tests is essential to ensure the feature works as expected and to prevent future breakages.
Please address these points to improve the quality and reliability of the changes.
| request_output.metrics = RequestMetrics( | ||
| arrival_time=req_state.stats.arrival_time, | ||
| last_token_time=req_state.stats.last_token_ts, | ||
| first_token_time=req_state.stats.first_token_ts, | ||
| num_generation_tokens=req_state.stats.num_generation_tokens, | ||
| first_token_latency=req_state.stats.first_token_latency, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new logic for populating request_output.metrics is not covered by any tests, as indicated by 'Test Plan: NA'. New features, especially those involving data structures and logic like this, must be accompanied by tests to ensure correctness and prevent future regressions.
Please add unit tests to verify that the RequestMetrics object is populated correctly. A good place for this would be in tests/v1/engine/test_output_processor.py, for example by extending test_iteration_stats to check the contents of request_output.metrics.
| request_output.metrics = RequestMetrics( | ||
| arrival_time=req_state.stats.arrival_time, | ||
| last_token_time=req_state.stats.last_token_ts, | ||
| first_token_time=req_state.stats.first_token_ts, | ||
| num_generation_tokens=req_state.stats.num_generation_tokens, | ||
| first_token_latency=req_state.stats.first_token_latency, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new RequestMetrics object is missing some available metrics. req_state.stats contains scheduled_ts and queued_ts, which can be used to populate first_scheduled_time and time_in_queue. Also, timestamp fields like first_token_ts can be 0.0 if not set, which should probably be converted to None for consistency.
Consider populating more fields and handling the 0.0 case for timestamps.
request_output.metrics = RequestMetrics(
arrival_time=req_state.stats.arrival_time,
last_token_time=req_state.stats.last_token_ts,
first_scheduled_time=req_state.stats.scheduled_ts if req_state.stats.scheduled_ts > 0 else None,
first_token_time=req_state.stats.first_token_ts if req_state.stats.first_token_ts > 0 else None,
time_in_queue=(req_state.stats.scheduled_ts - req_state.stats.queued_ts) if req_state.stats.scheduled_ts > 0 and req_state.stats.queued_ts > 0 else None,
num_generation_tokens=req_state.stats.num_generation_tokens,
first_token_latency=req_state.stats.first_token_latency)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| if request_output := req_state.make_request_output( | ||
| new_token_ids, pooling_output, finish_reason, stop_reason, | ||
| kv_transfer_params): | ||
| request_output.metrics = RequestMetrics( | ||
| arrival_time=req_state.stats.arrival_time, | ||
| last_token_time=req_state.stats.last_token_ts, | ||
| first_token_time=req_state.stats.first_token_ts, | ||
| num_generation_tokens=req_state.stats.num_generation_tokens, | ||
| first_token_latency=req_state.stats.first_token_latency, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard RequestMetrics creation when stats logging is disabled
The new metrics assignment assumes req_state.stats is always populated, but RequestState.__init__ sets self.stats to None whenever log_stats=False (e.g. the default for vllm.LLM() in entrypoints/llm.py). In that configuration, request_output.metrics = RequestMetrics(arrival_time=req_state.stats.arrival_time, …) will raise an AttributeError on the first token because req_state.stats is None. Previously the code only touched self.stats inside logging code paths guarded by the same flag. Consider skipping this assignment when req_state.stats is None or synthesizing defaults to avoid breaking non‑logging runs.
Useful? React with 👍 / 👎.
|
See also #24947 - at a glance, it is doing a similar thing as this PR |
Summary:
Support passing some inference metrics through request_output. Therefore, these metrics can be passed down through llm engine step().
It is very useful when users need to leverage these metrics
Test Plan: NA
Reviewed By: 22quinn
Differential Revision: D83576815