Skip to content

Conversation

raptorsun
Copy link
Contributor

@raptorsun raptorsun commented Sep 3, 2025

do not merge

Summary by CodeRabbit

  • New Features

    • Added token usage tracking for AI responses, enabling provider- and model-aware metrics without changing user-visible behavior.
    • Metrics update occurs for both standard and streaming responses, with safe fallbacks to avoid disrupting replies.
  • Tests

    • Expanded unit tests to cover token counting logic and to mock metrics in endpoint tests for stability.
  • Chores

    • Improved observability by integrating token metrics into response handling while preserving current response formats and performance.

Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Walkthrough

Adds token-counting instrumentation for LLM turns. Introduces update_llm_token_count_from_turn in metrics.utils. Wires token metrics into query and streaming endpoints. Extends retrieve_response with a provider_id parameter and forwards it. Updates tests to mock metrics and adds a unit test for the new utility.

Changes

Cohort / File(s) Summary of Changes
Query endpoint metrics + signature
src/app/endpoints/query.py
Import metrics helper; forward provider_id to response retrieval; extend retrieve_response signature with *, provider_id: str = ""; compute model_label and call update_llm_token_count_from_turn after LLM response; expand pylint disables.
Streaming endpoint metrics
src/app/endpoints/streaming_query.py
Import metrics helper; on turn_complete, derive system_prompt and call update_llm_token_count_from_turn; wrap in try/except with log on failure; no other flow changes.
Metrics utility addition
src/metrics/utils.py
Add update_llm_token_count_from_turn(turn, model, provider, system_prompt=""); use Tokenizer/ChatFormat to count tokens for output and input (with optional system prompt) and increment llm_token_received_total/llm_token_sent_total.
Query endpoint tests updated
tests/unit/app/endpoints/test_query.py
Add mock_metrics(mocker) helper patching app.endpoints.query.update_llm_token_count_from_turn; invoke helper across tests to noop metrics.
Streaming endpoint tests updated
tests/unit/app/endpoints/test_streaming_query.py
Add mock_metrics(mocker) patching app.endpoints.streaming_query.update_llm_token_count_from_turn; use in affected tests.
Metrics utility tests
tests/unit/metrics/test_utis.py
Import new function; add test_update_llm_token_count_from_turn mocking Tokenizer/ChatFormat; assert sent/received token metrics increments with labels.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as QueryEndpoint
  participant Core as retrieve_response
  participant LLM as Provider
  participant Metrics as metrics.utils

  Client->>API: POST /query
  API->>Core: retrieve_response(..., provider_id)
  Core->>LLM: invoke(model_id, input)
  LLM-->>Core: assistant output
  Note over Core,Metrics: New: token counting for turn
  Core->>Metrics: update_llm_token_count_from_turn(turn, model_label, provider_id, system_prompt)
  Core-->>API: TurnSummary, conversation_id
  API-->>Client: 200 OK
Loading
sequenceDiagram
  autonumber
  actor Client
  participant API as StreamingEndpoint
  participant Stream as StreamLoop
  participant Metrics as metrics.utils

  Client->>API: POST /streaming_query (SSE)
  API->>Stream: start stream loop
  loop tokens/events
    Stream-->>Client: events...
  end
  Note over Stream,Metrics: New: on turn_complete
  Stream->>Metrics: update_llm_token_count_from_turn(turn, model_id, provider_id, system_prompt)
  Stream-->>Client: stream end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • manstis
  • tisnik
  • umago

Poem

Little paws count tokens, hop by hop,
Bytes like carrots pile up in a crop.
Query streams, metrics gleam, tick-tock—
Models whisper, logs keep stock.
With every turn, I thump in cheer:
Sent and received, the numbers are clear. 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@raptorsun raptorsun closed this Sep 3, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
tests/unit/app/endpoints/test_streaming_query.py (1)

357-358: Reduce repetition by auto-mocking metrics for this module.
Consider a module-scoped autouse fixture so future tests don’t forget to patch metrics.

Apply:

+@pytest.fixture(autouse=True)
+def _mock_metrics_autouse(mocker):
+    mocker.patch(
+        "app.endpoints.streaming_query.update_llm_token_count_from_turn",
+        return_value=None,
+    )

Also applies to: 364-365

tests/unit/app/endpoints/test_query.py (1)

454-454: Avoid repetitive calls to mock_metrics across tests.
Use an autouse fixture to patch once per module and prevent omissions.

Apply:

@@
-def mock_metrics(mocker):
-    """Helper function to mock metrics operations for query endpoints."""
-    mocker.patch(
-        "app.endpoints.query.update_llm_token_count_from_turn",
-        return_value=None,
-    )
+@pytest.fixture(autouse=True)
+def _mock_metrics_autouse(mocker):
+    """Auto-mock metrics for all tests in this module."""
+    mocker.patch("app.endpoints.query.update_llm_token_count_from_turn", return_value=None)

Then remove individual mock_metrics(...) calls in tests.

Also applies to: 486-486, 519-519, 559-559, 609-609, 662-662, 716-716, 774-774, 829-829, 885-885, 955-955, 1016-1016, 1351-1351, 1402-1402

src/app/endpoints/streaming_query.py (1)

625-631: Token metrics on turn_complete: OK; minor nit on recomputing system_prompt.
You recompute system_prompt inside the loop; it’s constant per request. Compute once before the async for and close over it.

Example (outside the generator’s loop):

system_prompt = get_system_prompt(query_request, configuration)
async for chunk in turn_response:
    ...
tests/unit/metrics/test_utis.py (1)

1-1: Nit: filename typo.
Consider renaming test_utis.py → test_utils.py for clarity.

src/app/endpoints/query.py (1)

392-400: Make provider_id default safer and explicit.

Avoid empty-string labels in metrics; prefer a sentinel and validate early.

Apply:

-async def retrieve_response(  # pylint: disable=too-many-locals,too-many-branches,too-many-arguments
+async def retrieve_response(  # pylint: disable=too-many-locals,too-many-branches,too-many-arguments
     client: AsyncLlamaStackClient,
     model_id: str,
     query_request: QueryRequest,
     token: str,
     mcp_headers: dict[str, dict[str, str]] | None = None,
     *,
-    provider_id: str = "",
+    provider_id: str = "unknown",
 ) -> tuple[TurnSummary, str]:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f8da8e0 and 0c5e610.

📒 Files selected for processing (6)
  • src/app/endpoints/query.py (5 hunks)
  • src/app/endpoints/streaming_query.py (2 hunks)
  • src/metrics/utils.py (2 hunks)
  • tests/unit/app/endpoints/test_query.py (16 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
  • tests/unit/metrics/test_utis.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
src/metrics/utils.py (2)
src/client.py (1)
  • AsyncLlamaStackClientHolder (18-50)
src/configuration.py (1)
  • configuration (60-64)
tests/unit/app/endpoints/test_streaming_query.py (1)
tests/unit/app/endpoints/test_query.py (1)
  • mock_metrics (49-54)
src/app/endpoints/streaming_query.py (3)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/utils/endpoints.py (1)
  • get_system_prompt (70-101)
src/configuration.py (1)
  • configuration (60-64)
tests/unit/app/endpoints/test_query.py (1)
tests/unit/app/endpoints/test_streaming_query.py (1)
  • mock_metrics (61-66)
tests/unit/metrics/test_utis.py (1)
src/metrics/utils.py (2)
  • setup_model_metrics (20-57)
  • update_llm_token_count_from_turn (60-77)
src/app/endpoints/query.py (3)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/app/endpoints/streaming_query.py (1)
  • retrieve_response (685-796)
src/models/requests.py (1)
  • QueryRequest (72-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-pr
  • GitHub Check: e2e_tests
🔇 Additional comments (10)
tests/unit/app/endpoints/test_streaming_query.py (1)

61-67: Good isolation of metrics side effects in streaming tests.
Mock cleanly patches update_llm_token_count_from_turn, keeping tests deterministic.

tests/unit/app/endpoints/test_query.py (1)

49-55: Good helper to null metrics during unit tests.
Keeps tests focused on behavior without counting tokens.

src/app/endpoints/streaming_query.py (1)

29-29: Import of token-counting utility is appropriate.
Matches usage in the generator; import location is fine.

src/metrics/utils.py (2)

3-10: New imports align with the added functionality.
Scope and ordering look good.


60-77: I’ve run the checks and confirmed that turn.output_message and turn.input_messages are already RawMessage instances and that ChatFormat.encode_dialog_prompt accepts them directly—no runtime mismatch. The system prompt should indeed be role="system", not "user". Removing the proposed defensive normalization, simply update the system prompt’s role:

-    input_messages = [RawMessage(role="user", content=system_prompt)] + cast(
-        list[RawMessage], turn.input_messages
-    )
+    input_messages = []
+    if system_prompt:
+        input_messages.append(RawMessage(role="system", content=system_prompt))
+    input_messages += cast(list[RawMessage], turn.input_messages)

No provider‐specific mapping needed here.

tests/unit/metrics/test_utis.py (2)

3-3: Import coverage looks good.
You’re testing both setup and the new token-count function.


79-124: Solid unit test for token-count updates with clear expectations.
Mocks are sufficient and assertions target the right metric labels.

src/app/endpoints/query.py (3)

27-27: Import for token metrics is correct.

Import location and symbol name match usage below. No issues.


222-223: Good: provider_id forwarded explicitly.

Keyword-only usage avoids positional breakage and keeps call sites clear.


418-419: Docstring updated correctly.

Parameter docs align with signature and intent.

Comment on lines +518 to +521
# Update token count metrics for the LLM call
model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Harden metrics update to never break the endpoint; use last path segment.

Instrumentation must be best-effort. If Tokenizer/formatter or turn fields are missing, this can raise and 500 the request. Also, rsplit is safer for nested identifiers.

Apply:

-    # Update token count metrics for the LLM call
-    model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
-    update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+    # Update token count metrics for the LLM call (best-effort; do not disrupt response)
+    try:
+        model_label = model_id.rsplit("/", 1)[-1]
+        update_llm_token_count_from_turn(
+            response, model_label, provider_id or "unknown", system_prompt
+        )
+    except Exception as err:  # pylint: disable=broad-except
+        logger.warning("Failed to update token count metrics: %s", err)

If you prefer, we can centralize this “safe update” in a tiny helper to reuse here and in streaming_query.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Update token count metrics for the LLM call
model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
# Update token count metrics for the LLM call (best-effort; do not disrupt response)
try:
model_label = model_id.rsplit("/", 1)[-1]
update_llm_token_count_from_turn(
response, model_label, provider_id or "unknown", system_prompt
)
except Exception as err: # pylint: disable=broad-except
logger.warning("Failed to update token count metrics: %s", err)
🤖 Prompt for AI Agents
In src/app/endpoints/query.py around lines 518-521, the metrics update can raise
if tokenizer/formatter or expected fields on the turn/response are missing and
using split("/") can pick the wrong segment for nested identifiers; change to
use rsplit("/", 1) to get the last path segment for model_label and wrap the
call to update_llm_token_count_from_turn in a try/except that swallows
exceptions (and logs a debug/warn) so instrumentation is best-effort and cannot
500 the endpoint; optionally factor this logic into a small helper (used here
and in streaming_query) that extracts the safe model_label and invokes the
update inside a non-throwing guard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant