-
Notifications
You must be signed in to change notification settings - Fork 45
LCORE-220: Add metrics to lightspeed-stack #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces a comprehensive metrics and observability layer to the application. Prometheus-compatible metrics are added, tracked, and exposed via a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FastAPI App
participant Metrics Middleware
participant Endpoint Handler
participant Prometheus Client
Client->>FastAPI App: HTTP request (any endpoint)
FastAPI App->>Metrics Middleware: Pass request
Metrics Middleware->>Endpoint Handler: Forward request
Endpoint Handler-->>Metrics Middleware: Response
Metrics Middleware->>Prometheus Client: Record metrics (duration, status, path)
Metrics Middleware-->>FastAPI App: Return response
FastAPI App-->>Client: Response
Client->>FastAPI App: GET /metrics
FastAPI App->>metrics_endpoint_handler: Handle request
metrics_endpoint_handler->>Prometheus Client: Collect latest metrics
Prometheus Client-->>metrics_endpoint_handler: Metrics data
metrics_endpoint_handler-->>FastAPI App: PlainTextResponse (metrics)
FastAPI App-->>Client: Prometheus metrics
Estimated code review effort3 (~40 minutes) Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (14)
🚧 Files skipped from review as they are similar to previous changes (13)
🔇 Additional comments (15)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Need to add more unit tests also waiting on #215 to merge to add the token metrics. |
ca00143
to
ed6c493
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks ok
src/app/endpoints/query.py
Outdated
import os | ||
from pathlib import Path | ||
from typing import Any | ||
from typing import Any, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple
;) does not need to be imported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops, bad habits haha I will fix it
src/app/endpoints/query.py
Outdated
"""Select the model ID based on the request or available models.""" | ||
def select_model_and_provider_id( | ||
models: ModelListResponse, query_request: QueryRequest | ||
) -> Tuple[str, str | None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
src/metrics/utils.py (1)
13-33
: Add error handling for client operations.The function lacks error handling for potential failures when accessing the client or listing models, which could cause the application to crash during startup.
def setup_model_metrics() -> None: """Perform setup of all metrics related to LLM model and provider.""" - client = LlamaStackClientHolder().get_client() - models = [ - model - for model in client.models.list() - if model.model_type == "llm" # pyright: ignore[reportAttributeAccessIssue] - ] + try: + client = LlamaStackClientHolder().get_client() + all_models = client.models.list() + models = [ + model + for model in all_models + if hasattr(model, 'model_type') and model.model_type == "llm" + ] + except Exception as e: + logger.error("Failed to retrieve models for metrics setup: %s", e) + return for model in models: - provider = model.provider_id - model_name = model.identifier + provider = getattr(model, 'provider_id', None) + model_name = getattr(model, 'identifier', None) if provider and model_name: - label_key = (provider, model_name) - metrics.provider_model_configuration.labels(*label_key).set(1) - logger.debug( - "Set provider/model configuration for %s/%s to 1", - provider, - model_name, - ) + try: + label_key = (provider, model_name) + metrics.provider_model_configuration.labels(*label_key).set(1) + logger.debug( + "Set provider/model configuration for %s/%s to 1", + provider, + model_name, + ) + except Exception as e: + logger.error("Failed to set metric for %s/%s: %s", provider, model_name, e)
🧹 Nitpick comments (5)
tests/unit/app/endpoints/test_metrics.py (1)
6-26
: Review the test structure and metric validations.The test correctly validates the basic endpoint functionality, but there are a few concerns:
- The test passes
None
as the request parameter, which works but suggests the endpoint doesn't use the request data- Lines 21 and 23 check for
_created
gauge metrics which are typically auto-generated by Prometheus client libraries - these might be implementation-dependent and could make the test brittleConsider these improvements:
def test_metrics_endpoint(): """Test the metrics endpoint handler.""" - response = metrics_endpoint_handler(None) + # Create a mock request object or use a proper Request instance + from fastapi import Request + mock_request = Request({"type": "http", "method": "GET", "url": "http://test/metrics"}) + response = metrics_endpoint_handler(mock_request) assert response is not None assert response.status_code == 200 assert "text/plain" in response.headers["Content-Type"] response_body = response.body.decode() # Check if the response contains Prometheus metrics format assert "# TYPE ls_rest_api_calls_total counter" in response_body assert "# TYPE ls_response_duration_seconds histogram" in response_body assert "# TYPE ls_provider_model_configuration gauge" in response_body assert "# TYPE ls_llm_calls_total counter" in response_body assert "# TYPE ls_llm_calls_failures_total counter" in response_body - assert "# TYPE ls_llm_calls_failures_created gauge" in response_body assert "# TYPE ls_llm_validation_errors_total counter" in response_body - assert "# TYPE ls_llm_validation_errors_created gauge" in response_body assert "# TYPE ls_llm_token_sent_total counter" in response_body assert "# TYPE ls_llm_token_received_total counter" in response_body + + # Verify the response follows Prometheus exposition format + lines = response_body.split('\n') + assert any(line.startswith('# HELP') for line in lines), "Should contain HELP comments"tests/unit/metrics/test_utis.py (2)
1-1
: Fix the filename typo.The filename should be
test_utils.py
instead oftest_utis.py
to match the module being tested (utils.py
).
6-24
: Good test structure with comprehensive mocking.The test properly mocks external dependencies and verifies the expected behavior. Consider adding edge case tests for better coverage.
Consider adding these additional test cases:
def test_setup_model_metrics_no_models(mocker): """Test setup_model_metrics when no LLM models are available.""" mock_client = mocker.patch("client.LlamaStackClientHolder.get_client").return_value mock_metric = mocker.patch("metrics.provider_model_configuration") mock_client.models.list.return_value = [] setup_model_metrics() # Should not call set on any metrics mock_metric.labels.assert_not_called() def test_setup_model_metrics_missing_provider_or_model(mocker): """Test setup_model_metrics with models missing provider_id or identifier.""" mock_client = mocker.patch("client.LlamaStackClientHolder.get_client").return_value mock_metric = mocker.patch("metrics.provider_model_configuration") fake_models = [ mocker.Mock(provider_id=None, identifier="test_model", model_type="llm"), mocker.Mock(provider_id="test_provider", identifier=None, model_type="llm"), ] mock_client.models.list.return_value = fake_models setup_model_metrics() # Should not call set for models with missing data mock_metric.labels.assert_not_called()src/app/endpoints/streaming_query.py (1)
437-438
: Consider the timing of metrics incrementation.The
llm_calls_total
metric is incremented immediately after model selection but before the actual streaming response is generated. This means successful model selection is counted as a successful LLM call, even if the streaming fails later.Consider moving the metrics increment after the streaming completes successfully:
response, conversation_id = await retrieve_response( client, model_id, query_request, token, mcp_headers=mcp_headers, ) - # Update metrics for the LLM call - metrics.llm_calls_total.labels(provider_id, model_id).inc() metadata_map: dict[str, dict[str, Any]] = {} async def response_generator(turn_response: Any) -> AsyncIterator[str]: """Generate SSE formatted streaming response.""" chunk_id = 0 complete_response = "" # Send start event yield stream_start_event(conversation_id) async for chunk in turn_response: for event in stream_build_event(chunk, chunk_id, metadata_map): complete_response += json.loads(event.replace("data: ", ""))[ "data" ]["token"] chunk_id += 1 yield event yield stream_end_event(metadata_map) + # Update metrics for successful LLM call completion + metrics.llm_calls_total.labels(provider_id, model_id).inc()src/app/endpoints/query.py (1)
302-309
: Validation error detection logic is sound.The post-response validation error checking iterates through response steps and correctly identifies shield violations. The metrics increment and early break are appropriate.
However, consider adding a debug log for when validation errors are detected:
# Check for validation errors in the response steps = getattr(response, "steps", []) for step in steps: if step.step_type == "shield_call" and step.violation: # Metric for LLM validation errors + logger.debug("Shield validation violation detected: %s", step.violation) metrics.llm_calls_validation_errors_total.inc() break
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (14)
pyproject.toml
(1 hunks)src/app/endpoints/metrics.py
(1 hunks)src/app/endpoints/query.py
(9 hunks)src/app/endpoints/streaming_query.py
(5 hunks)src/app/main.py
(2 hunks)src/app/routers.py
(2 hunks)src/metrics/__init__.py
(1 hunks)src/metrics/utils.py
(1 hunks)tests/unit/app/endpoints/test_metrics.py
(1 hunks)tests/unit/app/endpoints/test_query.py
(15 hunks)tests/unit/app/endpoints/test_streaming_query.py
(5 hunks)tests/unit/app/test_routers.py
(5 hunks)tests/unit/metrics/__init__.py
(1 hunks)tests/unit/metrics/test_utis.py
(1 hunks)
🧬 Code Graph Analysis (5)
tests/unit/metrics/test_utis.py (1)
src/metrics/utils.py (1)
setup_model_metrics
(13-32)
src/metrics/utils.py (2)
src/client.py (1)
LlamaStackClientHolder
(19-52)src/log.py (1)
get_logger
(7-13)
tests/unit/app/endpoints/test_metrics.py (1)
src/app/endpoints/metrics.py (1)
metrics_endpoint_handler
(14-16)
src/app/main.py (3)
src/log.py (1)
get_logger
(7-13)src/metrics/utils.py (1)
setup_model_metrics
(13-32)src/app/routers.py (1)
include_routers
(20-38)
src/app/endpoints/query.py (2)
src/app/endpoints/streaming_query.py (1)
retrieve_response
(490-567)src/models/requests.py (1)
QueryRequest
(61-146)
🧰 Additional context used
🧬 Code Graph Analysis (5)
tests/unit/metrics/test_utis.py (1)
src/metrics/utils.py (1)
setup_model_metrics
(13-32)
src/metrics/utils.py (2)
src/client.py (1)
LlamaStackClientHolder
(19-52)src/log.py (1)
get_logger
(7-13)
tests/unit/app/endpoints/test_metrics.py (1)
src/app/endpoints/metrics.py (1)
metrics_endpoint_handler
(14-16)
src/app/main.py (3)
src/log.py (1)
get_logger
(7-13)src/metrics/utils.py (1)
setup_model_metrics
(13-32)src/app/routers.py (1)
include_routers
(20-38)
src/app/endpoints/query.py (2)
src/app/endpoints/streaming_query.py (1)
retrieve_response
(490-567)src/models/requests.py (1)
QueryRequest
(61-146)
🔇 Additional comments (42)
tests/unit/metrics/__init__.py (1)
1-1
: LGTM! Clean package initialization.The docstring clearly describes the package purpose and follows Python conventions for test package organization.
pyproject.toml (1)
16-17
: Dependencies are secure and up-to-dateBoth added packages meet the latest stable release criteria and have no known vulnerabilities:
- In pyproject.toml (lines 16–17):
- prometheus-client >= 0.22.1 is already at the most recent stable 0.22.1.
- starlette >= 0.47.1 will pull in 0.47.2 by default, the current stable release.
No further changes are required.
src/app/routers.py (2)
16-16
: LGTM! Proper import addition.The metrics module import is correctly added to the existing endpoint imports.
38-38
: LGTM! Consistent router inclusion.The metrics router is properly included without a version prefix, following the same pattern as other infrastructure endpoints like health and authorized.
src/app/endpoints/metrics.py (1)
1-17
: LGTM! Standard Prometheus metrics endpoint implementation.The metrics endpoint is correctly implemented following Prometheus best practices:
- Proper HTTP GET method for metrics scraping
- PlainTextResponse with correct Prometheus content type
- Uses official prometheus_client library functions
- Unused request parameter properly handled with underscore prefix
tests/unit/app/test_routers.py (2)
18-18
: LGTM! Proper test import addition.The metrics import is correctly added to match the router configuration changes.
48-48
: LGTM! Comprehensive test updates for new router.The tests are properly updated to account for the new metrics router:
- Router count correctly incremented from 10 to 11
- Both test functions consistently updated
- Proper assertions for router presence and prefix verification
Also applies to: 59-59, 68-68, 79-79
src/app/endpoints/streaming_query.py (4)
26-26
: Good addition of metrics import.The metrics module import is correctly placed and follows the existing import structure.
427-429
: Verify the model selection function update.The change from
select_model_id
toselect_model_and_provider_id
is consistent with the updates inquery.py
and properly handles the tuple return value.
233-234
: Appropriate placement of validation error metrics.The validation error metric is correctly incremented when a shield violation is detected during streaming.
478-479
: Proper error metrics handling.The failure metric is correctly incremented when an API connection error occurs.
src/app/endpoints/query.py (6)
8-8
: Tuple import is now used.The
Tuple
import is now properly utilized in the function signature at line 172, addressing the previous review comment.
27-27
: Good addition of metrics import.The metrics module import is correctly placed and follows the existing import structure.
126-128
: Function call properly updated for tuple return.The call to
select_model_and_provider_id
correctly unpacks the tuple return value intomodel_id
andprovider_id
.
136-137
: Metrics tracking well-placed for successful calls.The
llm_calls_total
metric is incremented after the response is successfully retrieved, which is appropriate timing for non-streaming queries.
158-159
: Proper failure metrics handling.The failure metric is correctly incremented when an API connection error occurs, maintaining consistency with the streaming endpoint.
170-216
: Function signature update implemented correctly.The function signature change from
select_model_id
toselect_model_and_provider_id
is properly implemented with the correct return typeTuple[str, str | None]
. The logic correctly handles both auto-selection and explicit model/provider specification.src/app/main.py (4)
3-15
: LGTM - Import statements are correct and necessary.All new imports are properly utilized in the code: typing imports for middleware function signature, FastAPI imports for request/response handling, Starlette routing imports for route filtering, and metrics imports for observability functionality.
71-76
: Correct implementation of dynamic route path extraction.The list comprehension properly:
- Extracts paths after all routers are included
- Filters for relevant route types (Mount, Route, WebSocketRoute)
- Provides the necessary data structure for middleware filtering
77-77
: Proper placement of model metrics initialization.The
setup_model_metrics()
call is correctly positioned after router inclusion, ensuring all application components are initialized before setting up model-related metrics.
43-66
: No issues found – middleware metrics logic is sound andapp_routes_paths
is properly initialized prior to registration.The startup order is correct:
- Routers are included (
routers.include_routers(app)
)app_routes_paths
is defined immediately afterward- The
rest_api_metrics
middleware is registered lastAll metrics collection, timing, and filtering behavior can proceed safely.
tests/unit/app/endpoints/test_streaming_query.py (4)
225-227
: Correct update to reflect API signature change.The test properly adapts to the new
select_model_and_provider_id
function that returns a tuple(model_id, provider_id)
instead of a single model ID.
736-768
: Proper integration of metrics validation in shield call tests.The test correctly mocks the validation error metric and asserts it's not incremented when there's no violation, ensuring metrics accuracy.
771-811
: Correct verification of metric increment on shield violation.The test properly validates that the validation error metric is incremented exactly once when a shield violation occurs.
1541-1542
: Consistent API signature update in auth tuple test.The test correctly adapts to the new function signature, maintaining test integrity while reflecting the API change to return model and provider IDs as a tuple.
src/metrics/__init__.py (5)
1-7
: Clean module structure with appropriate Prometheus imports.The module properly imports all necessary Prometheus metric types for comprehensive observability.
9-20
: Well-designed REST API metrics with appropriate labels.The counter and histogram metrics provide comprehensive coverage of API usage patterns and performance characteristics. The naming conventions follow Prometheus standards.
22-28
: Appropriate gauge metric for tracking model/provider configurations.The gauge correctly represents active provider/model combinations with descriptive labels for popularity analysis.
30-41
: Comprehensive LLM call tracking with appropriate granularity.The metrics provide good coverage: successful calls tracked by provider/model, while failures and validation errors are tracked as totals. The label strategy is well-reasoned.
43-51
: Token metrics properly structured for future implementation.The counter definitions are correctly designed with appropriate labels. The TODO comments clearly indicate these are placeholders pending the token counting feature from PR #215.
tests/unit/app/endpoints/test_query.py (12)
15-15
: Correct import update reflecting API signature change.The import properly updates to the new function name that returns both model and provider IDs.
66-66
: Good defensive programming in test fixture.Initializing
mock_agent.create_turn.return_value.steps
as an empty list prevents AttributeError in tests that access this property.
111-111
: Proper metrics integration testing.The test correctly mocks and verifies that the LLM calls metric is incremented with the appropriate provider and model labels.
Also applies to: 152-154
182-200
: Complete test adaptation to new function signature.The test correctly updates function name, unpacks the tuple return value, and asserts both model_id and provider_id values.
202-226
: Proper handling of default model selection test.The test correctly verifies that when no model is specified, the first available LLM model and its provider are returned.
228-246
: Improved exception handling with HTTPException.The test correctly updates to expect
HTTPException
instead of genericException
, providing better error specificity.
248-259
: Consistent error handling for no available models.The test maintains proper error validation while adapting to the new function signature and exception type.
321-321
: Proper negative testing for validation error metrics.The test correctly verifies that validation error metrics are not incremented when no shield violations occur.
Also applies to: 345-346
867-909
: Comprehensive test for shield violation metrics tracking.The test properly simulates a shield violation scenario and verifies that the validation error metric is incremented correctly while maintaining full coverage of the function behavior.
1002-1003
: Correct metrics tracking for connection failures.The test properly verifies that the failure metric is incremented when API connection errors occur.
Also applies to: 1020-1020
775-777
: Good refactoring to use consistent test fixtures.The test now uses the standard
prepare_agent_mocks
fixture for better consistency and maintainability.
1323-1326
: Consistent API signature update in auth testing.The test correctly adapts to the new function signature while maintaining verification of proper auth token handling.
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <[email protected]>
Description
This patch is adding the /metrics endpoint to the project.
Differences with road-core/service:
Supported metrics:
Missing metrics:
The above metrics are missing because token counting PR (#215) is not merged yet.
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
/metrics
API endpoint providing real-time Prometheus-compatible metrics for API usage, performance, and LLM interactions.Bug Fixes
Chores