Skip to content

Conversation

luis5tb
Copy link

@luis5tb luis5tb commented Sep 15, 2025

Description

Add v2 endpoint to interact with LlamaStack using Responses API instead of Agent API

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

You need changes at LlamaStack (llamastack/llama-stack#3385) to be able to test this. I was using assisted-chat to test the end-to-end solution. I had to create new scripts to leverage the new v2 (make query_v2 or make streaming_query_v2)

Summary by CodeRabbit

  • New Features

    • Added V2 Query APIs: POST /v2/query and /v2/streaming_query (SSE). Includes RBAC, model/provider selection, RAG/MCP tool integration, attachment validation, conversation persistence, metrics, and robust error handling.
  • Improvements

    • Updated routing to clearly separate V1/V2/V3, exposing new V2 endpoints and adding a V3 conversations router.
  • Chores

    • Bumped maximum supported Llama Stack version to 0.2.22.
  • Tests

    • Added comprehensive unit tests for V2 query and streaming endpoints and updated router tests to cover new versions.

Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

Adds v2 query and streaming endpoints using the Responses API, wires them under /v2 in routers, updates supported Llama Stack version, enhances SUID validation to accept Response IDs, and introduces comprehensive unit tests for new endpoints and routing.

Changes

Cohort / File(s) Summary
V2 Query Endpoint
src/app/endpoints/query_v2.py
New POST /v2/query endpoint with RBAC, model/provider resolution, Response API call, RAG/MCP tool assembly, transcript persistence, metrics, and error handling. Exposes router and helper functions (retrieve_response, get_rag_tools, get_mcp_tools).
V2 Streaming Endpoint
src/app/endpoints/streaming_query_v2.py
New streaming /v2/streaming_query endpoint emitting SSE events. Builds tools, streams tokens/tool calls, persists conversations, updates metrics, and handles connection errors.
Routers Wiring
src/app/routers.py
Imports and includes v2 routers (query_v2, streaming_query_v2) under /v2; retains v1; adds v3 conversations router inclusion.
Constants
src/constants.py
Bumps MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION from "0.2.20" to "0.2.22".
Utilities
src/utils/suid.py
check_suid now accepts Response IDs (resp-/resp_), normalizes and validates as UUIDs; input guard for non-strings/empty. Signature unchanged.
Unit Tests — Query V2
tests/unit/app/endpoints/test_query_v2.py
Tests RAG/MCP tool builders, retrieve_response behavior (no-tools, combined tools, parsing output/tool calls, attachment validation), endpoint success and APIConnectionError paths with metrics.
Unit Tests — Streaming Query V2
tests/unit/app/endpoints/test_streaming_query_v2.py
Tests retrieve_response tool building, no-tools path, streaming flow event sequence, metrics, conversation persistence, and APIConnectionError handling.
Unit Tests — Routers
tests/unit/app/test_routers.py
Verifies presence and /v2 prefixes for new routers (query_v2, streaming_query_v2, conversations_v2), and updated router count.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant API as API (/v2/query)
  participant Auth as RBAC/Auth
  participant Selector as Model/Provider Selector
  participant Tools as Tool Builder (RAG/MCP)
  participant Llama as Responses API Client
  participant Store as Transcripts/Conversations
  Client->>API: POST /v2/query (QueryRequest)
  API->>Auth: Validate + authorize
  Auth-->>API: Auth tuple
  API->>Selector: Resolve model/provider
  Selector-->>API: model_id, provider_id
  API->>Tools: Build tools (RAG, MCP) + validate attachments
  Tools-->>API: tools[]
  API->>Llama: responses.create(...)
  Llama-->>API: response (text, tool_calls, response_id)
  API->>Store: Persist transcript/conversation
  Store-->>API: ack
  API-->>Client: 200 QueryResponse (conversation_id, text)
  note over API,Llama: On APIConnectionError → 500 and failure metric
Loading
sequenceDiagram
  autonumber
  participant Client
  participant API as API (/v2/streaming_query)
  participant Auth as RBAC/Auth
  participant Selector as Model/Provider Selector
  participant Tools as Tool Builder (RAG/MCP)
  participant Llama as Responses API Stream
  participant Store as Conversations/Transcripts
  Client->>API: POST /v2/streaming_query
  API->>Auth: Validate + authorize
  Auth-->>API: Auth tuple
  API->>Selector: Resolve model/provider
  Selector-->>API: model_id
  API->>Tools: Build tools / validate
  Tools-->>API: tools[]
  API->>Llama: responses.create(stream=True)
  Llama-->>API: SSE chunks (start, tokens, tool_calls, end)
  API-->>Client: SSE stream
  API->>Store: Persist conversation on first id
  note over API,Llama: On connection error → 500 JSON error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • manstis
  • jrobertboos
  • umago

Poem

A bunny taps the v2 drum—thump-thump, hooray!
Tools hop in: RAG and MCP join the play.
Streams ripple tokens, a conversational brook,
IDs validated with a whisker’s look.
Versions nudge forward, carrots aligned—
Ship the routes, metrics chimed, all nicely defined! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.38% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Responses v2 support” succinctly captures the central feature of the pull request, which is adding support for the version 2 Responses API endpoints. It is concise, directly relevant to the main changes in the diff, and avoids extraneous details or ambiguous phrasing. A teammate reviewing the history would immediately understand that this PR introduces v2 of the Responses API.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/app/endpoints/streaming_query.py (1)

212-212: SSE response is missing correct content type and streaming headers

Same issue as v2; events may be buffered or ignored.

-        return StreamingResponse(response_generator(response))
+        return StreamingResponse(
+            response_generator(response),
+            media_type="text/event-stream",
+            headers={
+                "Cache-Control": "no-cache",
+                "Connection": "keep-alive",
+                "X-Accel-Buffering": "no",
+            },
+        )
src/app/endpoints/conversations.py (1)

88-99: Access session_id robustly; returned items may be models, not dicts

Using .get() on a Pydantic/typed object will fail.

-        session_id = str(agent_sessions[0].get("session_id"))
+        first = agent_sessions[0]
+        session_id = (
+            first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+        )
+        if not session_id:
+            raise HTTPException(
+                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                detail={"response": "Invalid session object", "cause": "session_id missing"},
+            )
+        session_id = str(session_id)
🧹 Nitpick comments (17)
src/utils/query.py (4)

309-310: Remove unused variable ‘token’.

Minor cleanup.

-    user_id, _, _, token = auth
+    user_id, _, _, _ = auth

338-350: Return 503 to match the documented OpenAPI response for connection errors.

Doc block maps connection failures to 503; handler currently uses 500.

-    raise HTTPException(
-        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+    raise HTTPException(
+        status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
         detail={
             "response": "Unable to connect to Llama Stack",
             "cause": str(e),
         },
     ) from e

33-52: Use 401 for Unauthorized in OpenAPI mapping.

400 is misleading for auth failures; map UnauthorizedResponse under 401.

-    400: {
-        "description": "Missing or invalid credentials provided by client",
-        "model": UnauthorizedResponse,
-    },
+    401: {
+        "description": "Missing or invalid credentials provided by client",
+        "model": UnauthorizedResponse,
+    },

258-288: Consider normalizing attachment metadata (lowercasing) before validation.

Prevents case-related rejects from clients.

-    for attachment in attachments:
+    for attachment in attachments:
+        # normalize without mutating original pydantic model
+        atype = attachment.attachment_type.lower()
+        ctype = attachment.content_type.lower()
-        if attachment.attachment_type not in constants.ATTACHMENT_TYPES:
+        if atype not in constants.ATTACHMENT_TYPES:
@@
-        if attachment.content_type not in constants.ATTACHMENT_CONTENT_TYPES:
+        if ctype not in constants.ATTACHMENT_CONTENT_TYPES:
src/app/endpoints/streaming_query_v2.py (2)

96-99: Pass the validated conversation context to model selection hints

Currently always passing None; prefer the actual user_conversation.

-            *evaluate_model_hints(user_conversation=None, query_request=query_request),
+            *evaluate_model_hints(
+                user_conversation=user_conversation, query_request=query_request
+            ),

108-109: Avoid metadata_map shadowing

Outer variable is unused and shadowed by inner definition.

-        metadata_map: dict[str, dict[str, Any]] = {}
+        # metadata captured within response_generator

Also applies to: 128-129

src/app/endpoints/streaming_query.py (1)

5-6: Remove unused regex now that metadata parsing lives in utils.streaming_query

Dead code after centralization.

-import re
-...
-METADATA_PATTERN = re.compile(r"\nMetadata: (\{.+})\n")
+# metadata extraction handled by utils.streaming_query

Also applies to: 59-59

src/utils/streaming_query.py (2)

164-172: Guard error payload shape

chunk.error may not always be a dict with "message".

-            "token": chunk.error["message"],
+            "token": getattr(chunk.error, "message", None)
+            or (chunk.error.get("message") if isinstance(chunk.error, dict) else str(chunk.error)),

80-100: End-event placeholders: consider carrying token counts when available

If tokens are tracked upstream, thread them through metadata_map or function args.

Happy to wire this if you expose counts from metrics/turn payloads.

src/app/endpoints/query_v2.py (2)

100-102: Prune unused params/imports and dead code

provider_id and mcp_headers aren’t used in retrieve_response; imported token-count util also unused; stray blank lines.

-        summary, conversation_id = await retrieve_response(
+        summary, conversation_id = await retrieve_response(
             client,
             llama_stack_model_id,
             query_request,
             token,
-            mcp_headers=mcp_headers,
-            provider_id=provider_id,
+            mcp_headers=mcp_headers,
         )
-    mcp_headers: dict[str, dict[str, str]] | None = None,
-    provider_id: str = "",
+    mcp_headers: dict[str, dict[str, str]] | None = None,
-    model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
-    #update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+    # Token accounting TBD for Responses API
-from metrics.utils import update_llm_token_count_from_turn

Also collapse extra blank import lines around utils.query.

Also applies to: 140-141, 264-266, 20-21, 31-39


220-233: Guard when response.output is None

Avoid iterating over None.

-    for idx, output_item in enumerate(response.output):
+    for idx, output_item in enumerate(response.output or []):
src/app/endpoints/conversations_v2.py (1)

58-79: Docs/status coverage: add 400/403 error responses.

This handler can raise 400 (invalid ID) and 403 (access denied) via validate_conversation_id/access, but conversation_responses doesn’t document them.

I recommend extending conversation_responses in utils.conversations to include 400 and 403 entries so OpenAPI reflects actual behavior.

src/utils/conversations.py (4)

59-66: OpenAPI example mismatch with model (message vs response).

ConversationDeleteResponse uses the field response, but the 200 example here uses message. This misleads API consumers.

 conversation_delete_responses: dict[int | str, dict[str, Any]] = {
         "conversation_id": "123e4567-e89b-12d3-a456-426614174000",
         "success": True,
-        "message": "Conversation deleted successfully",
+        "response": "Conversation deleted successfully",
     },

31-57: Enhance responses spec: document 400/403 for conversation retrieval.

The retrieval flow can emit:

  • 400 for invalid ID (validate_conversation_id)
  • 403 for unauthorized access (validate_conversation_access)
 conversation_responses: dict[int | str, dict[str, Any]] = {
         "conversation_id": "123e4567-e89b-12d3-a456-426614174000",
         "chat_history": [
             {
                 "messages": [
                     {"content": "Hi", "type": "user"},
                     {"content": "Hello!", "type": "assistant"},
                 ],
                 "started_at": "2024-01-01T00:00:00Z",
                 "completed_at": "2024-01-01T00:00:05Z",
             }
         ],
     },
+    400: {
+        "detail": {
+            "response": "Invalid conversation ID format",
+            "cause": "Conversation ID is not a valid SUID/Response ID",
+        }
+    },
+    403: {
+        "detail": {
+            "response": "Access denied",
+            "cause": "You do not have permission to read this conversation",
+        }
+    },
         "detail": {
             "response": "Conversation not found",
             "cause": "The specified conversation ID does not exist.",
         }
     },

109-149: Guard against None output messages in simplified history.

Currently appends an output message even when output_message is missing, producing {"content": None, "type": None}.

-    # Extract only essential data from each turn
+    # Extract only essential data from each turn
     for turn in session_data.get("turns", []):
         # Clean up input messages
         cleaned_messages = []
         for msg in turn.get("input_messages", []):
             cleaned_msg = {
                 "content": msg.get("content"),
                 "type": msg.get("role"),  # Rename role to type
             }
             cleaned_messages.append(cleaned_msg)

         # Clean up output message
         output_msg = turn.get("output_message", {})
-        cleaned_messages.append(
-            {
-                "content": output_msg.get("content"),
-                "type": output_msg.get("role"),  # Rename role to type
-            }
-        )
+        if output_msg and (output_msg.get("content") is not None):
+            cleaned_messages.append(
+                {
+                    "content": output_msg.get("content"),
+                    "type": output_msg.get("role"),  # Rename role to type
+                }
+            )

210-221: Clarify 400 error message for non-UUID IDs.

Since check_suid accepts opaque resp-/resp_ IDs, the current cause text “not a valid UUID” can be misleading.

-                "cause": f"Conversation ID {conversation_id} is not a valid UUID",
+                "cause": f"Conversation ID {conversation_id} is not a valid SUID/Response ID",
src/app/endpoints/query.py (1)

116-120: Align upstream-connection status codes across endpoints.

handle_api_connection_error() raises 500, while conversation deletion maps APIConnectionError to 503. Pick one (prefer 503 Service Unavailable for upstream outages) and standardize.

Would you like me to open a follow-up to change handle_api_connection_error() to raise 503 and update OpenAPI docs accordingly?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b14d91c and c3d8748.

📒 Files selected for processing (13)
  • pyproject.toml (1 hunks)
  • src/app/endpoints/conversations.py (7 hunks)
  • src/app/endpoints/conversations_v2.py (1 hunks)
  • src/app/endpoints/query.py (6 hunks)
  • src/app/endpoints/query_v2.py (1 hunks)
  • src/app/endpoints/streaming_query.py (2 hunks)
  • src/app/endpoints/streaming_query_v2.py (1 hunks)
  • src/app/routers.py (2 hunks)
  • src/constants.py (1 hunks)
  • src/utils/conversations.py (1 hunks)
  • src/utils/query.py (1 hunks)
  • src/utils/streaming_query.py (1 hunks)
  • src/utils/suid.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
src/app/endpoints/streaming_query_v2.py (8)
src/auth/__init__.py (1)
  • get_auth_dependency (14-43)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/utils/endpoints.py (3)
  • check_configuration_loaded (62-68)
  • get_system_prompt (71-102)
  • validate_model_provider_override (105-125)
src/utils/query.py (5)
  • evaluate_model_hints (94-133)
  • is_transcripts_enabled (55-61)
  • persist_user_conversation_details (64-91)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
src/utils/streaming_query.py (3)
  • format_stream_data (26-37)
  • stream_start_event (40-61)
  • stream_end_event (64-100)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/query_v2.py (2)
  • get_rag_tools (274-283)
  • get_mcp_tools (286-304)
src/utils/streaming_query.py (4)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • append_tool_calls_from_llama (65-78)
src/utils/endpoints.py (1)
  • get_system_prompt (71-102)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/app/endpoints/streaming_query.py (1)
  • response_generator (139-200)
src/utils/conversations.py (8)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/configuration.py (1)
  • configuration (60-64)
src/app/database.py (1)
  • get_session (34-40)
src/models/config.py (2)
  • config (130-136)
  • Action (303-342)
src/models/database/conversations.py (1)
  • UserConversation (11-36)
src/models/responses.py (4)
  • ConversationDeleteResponse (425-458)
  • ConversationsListResponse (493-550)
  • ConversationDetails (461-490)
  • UnauthorizedResponse (343-357)
src/utils/endpoints.py (3)
  • check_configuration_loaded (62-68)
  • delete_conversation (22-36)
  • validate_conversation_ownership (39-59)
src/utils/suid.py (1)
  • check_suid (15-44)
src/app/routers.py (1)
tests/unit/app/test_routers.py (1)
  • include_router (31-46)
src/app/endpoints/streaming_query.py (3)
src/utils/endpoints.py (2)
  • validate_model_provider_override (105-125)
  • validate_conversation_ownership (39-59)
src/utils/query.py (1)
  • select_model_and_provider_id (136-213)
src/utils/streaming_query.py (4)
  • format_stream_data (26-37)
  • stream_start_event (40-61)
  • stream_end_event (64-100)
  • stream_build_event (103-146)
src/app/endpoints/conversations_v2.py (5)
src/auth/__init__.py (1)
  • get_auth_dependency (14-43)
src/authorization/middleware.py (1)
  • authorize (111-122)
src/models/config.py (2)
  • config (130-136)
  • Action (303-342)
src/models/responses.py (3)
  • ConversationDeleteResponse (425-458)
  • ConversationResponse (375-422)
  • ConversationsListResponse (493-550)
src/utils/conversations.py (4)
  • get_conversations_list_base (151-207)
  • delete_conversation_base (262-360)
  • validate_conversation_id (210-220)
  • validate_conversation_access (223-259)
src/utils/query.py (9)
src/configuration.py (4)
  • configuration (60-64)
  • user_data_collection_configuration (81-85)
  • inference (121-125)
  • llama_stack_configuration (74-78)
src/app/database.py (1)
  • get_session (34-40)
src/models/config.py (2)
  • config (130-136)
  • Action (303-342)
src/models/database/conversations.py (1)
  • UserConversation (11-36)
src/models/requests.py (2)
  • QueryRequest (72-222)
  • Attachment (15-69)
src/models/responses.py (2)
  • UnauthorizedResponse (343-357)
  • ForbiddenResponse (360-372)
src/utils/endpoints.py (3)
  • check_configuration_loaded (62-68)
  • validate_conversation_ownership (39-59)
  • validate_model_provider_override (105-125)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (1)
  • TurnSummary (59-78)
src/app/endpoints/query_v2.py (8)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/models/responses.py (1)
  • QueryResponse (48-79)
src/utils/endpoints.py (3)
  • check_configuration_loaded (62-68)
  • get_system_prompt (71-102)
  • validate_model_provider_override (105-125)
src/utils/mcp_headers.py (1)
  • mcp_headers_dependency (15-26)
src/utils/query.py (4)
  • evaluate_model_hints (94-133)
  • process_transcript_and_persist_conversation (352-383)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/query.py (1)
src/utils/query.py (8)
  • evaluate_model_hints (94-133)
  • select_model_and_provider_id (136-213)
  • is_input_shield (242-255)
  • is_output_shield (229-239)
  • validate_attachments_metadata (258-287)
  • validate_query_request (290-335)
  • handle_api_connection_error (338-349)
  • process_transcript_and_persist_conversation (352-383)
src/app/endpoints/conversations.py (3)
src/utils/conversations.py (5)
  • simplify_session_data (109-148)
  • get_conversations_list_base (151-207)
  • validate_conversation_id (210-220)
  • validate_conversation_access (223-259)
  • delete_conversation_base (262-360)
src/models/config.py (1)
  • Action (303-342)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
🔇 Additional comments (7)
src/constants.py (1)

5-5: Version cap bump looks good. Ensure CI exercises 0.2.21 explicitly.

src/app/routers.py (1)

17-20: Mounting V2 under /v2 while preserving V1 looks good.

Route composition is clear; ordering avoids conflicts. Please ensure unit tests assert presence of the new /v2 routes.

Also applies to: 41-44

src/app/endpoints/conversations.py (1)

42-43: LGTM: list endpoint delegates cleanly to shared base

Good reuse and RBAC is handled upstream.

src/app/endpoints/query_v2.py (1)

274-304: LGTM: tool adapters for Responses API

Straightforward mapping for RAG and MCP tools; debug logging is helpful.

src/app/endpoints/conversations_v2.py (1)

48-56: LGTM: list conversations v2 delegates to shared base cleanly.

src/app/endpoints/query.py (2)

63-68: LGTM: clear docstring about Agent API usage.


102-109: Nice: centralizing transcript persistence.

Good separation via process_transcript_and_persist_conversation() and consistent metrics updates.

Comment on lines 32 to 43
async def get_response_sessions(_client, _conversation_id: str):
"""Placeholder for Response API session listing.
For now, we rely on the shared base which works without direct
backend calls until a dedicated Responses API endpoint is wired.
"""
return [
{"session_id": _conversation_id},
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix: session helper functions must be synchronous (they’re called without await).

delete_conversation_base() invokes get_session_func synchronously. Defining these as async returns coroutines and breaks truthiness checks and deletion flow.

Apply this diff:

-async def get_response_sessions(_client, _conversation_id: str):
+def get_response_sessions(_client, _conversation_id: str):
     """Placeholder for Response API session listing.

     For now, we rely on the shared base which works without direct
     backend calls until a dedicated Responses API endpoint is wired.
     """
     return [
         {"session_id": _conversation_id},
     ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def get_response_sessions(_client, _conversation_id: str):
"""Placeholder for Response API session listing.
For now, we rely on the shared base which works without direct
backend calls until a dedicated Responses API endpoint is wired.
"""
return [
{"session_id": _conversation_id},
]
def get_response_sessions(_client, _conversation_id: str):
"""Placeholder for Response API session listing.
For now, we rely on the shared base which works without direct
backend calls until a dedicated Responses API endpoint is wired.
"""
return [
{"session_id": _conversation_id},
]
🤖 Prompt for AI Agents
In src/app/endpoints/conversations_v2.py around lines 32 to 41, the session
helper is defined as async but is called synchronously by
delete_conversation_base; change the function to a regular synchronous function
(replace "async def" with "def"), keep the same return value (a list of dicts),
update the docstring to remove async wording, and ensure no callers await this
function so it returns a concrete list instead of a coroutine.

Comment on lines 43 to 50
async def delete_response_sessions(_client, _conversation_id: str, _sessions):
"""Placeholder for Response API session deletion."""
return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same issue: make deletion helper synchronous.

Matches the calling convention in delete_conversation_base().

-async def delete_response_sessions(_client, _conversation_id: str, _sessions):
+def delete_response_sessions(_client, _conversation_id: str, _sessions):
     """Placeholder for Response API session deletion."""
     return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def delete_response_sessions(_client, _conversation_id: str, _sessions):
"""Placeholder for Response API session deletion."""
return None
def delete_response_sessions(_client, _conversation_id: str, _sessions):
"""Placeholder for Response API session deletion."""
return None
🤖 Prompt for AI Agents
In src/app/endpoints/conversations_v2.py around lines 43 to 46, the helper
delete_response_sessions is currently async but should be synchronous to match
the calling convention in delete_conversation_base(); change the function to a
regular def, remove async/await usage, and return None (or perform synchronous
deletion logic) so callers expecting a synchronous function get a normal return
value instead of a coroutine.

Comment on lines 154 to 164
"""Delete agent sessions for a conversation."""
agent_id = conversation_id
session_id = str(sessions[0].get("session_id"))
await client.agents.session.delete(agent_id=agent_id, session_id=session_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same issue when deleting sessions

Avoid .get on possibly typed objects.

-    session_id = str(sessions[0].get("session_id"))
+    first = sessions[0]
+    session_id = first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+    if not session_id:
+        raise RuntimeError("session_id missing on session")
+    session_id = str(session_id)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Delete agent sessions for a conversation."""
agent_id = conversation_id
session_id = str(sessions[0].get("session_id"))
await client.agents.session.delete(agent_id=agent_id, session_id=session_id)
"""Delete agent sessions for a conversation."""
agent_id = conversation_id
first = sessions[0]
session_id = first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
if not session_id:
raise RuntimeError("session_id missing on session")
session_id = str(session_id)
await client.agents.session.delete(agent_id=agent_id, session_id=session_id)
🤖 Prompt for AI Agents
In src/app/endpoints/conversations.py around lines 154 to 157, the code uses
sessions[0].get("session_id") which relies on .get on a potentially typed
object; replace this with explicit, type-safe access: first validate that
sessions is a non-empty list and that sessions[0] is the expected type, then
extract the session_id with the appropriate accessor (e.g.,
sessions[0]["session_id"] if it's a dict or sessions[0].session_id if it's an
object), handle missing/invalid cases by raising a clear error or returning
early, and pass the resulting stringified session_id to
client.agents.session.delete.

Comment on lines 55 to 62
def is_transcripts_enabled() -> bool:
"""Check if transcripts is enabled.
Returns:
bool: True if transcripts is enabled, False otherwise.
"""
return configuration.user_data_collection_configuration.transcripts_enabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Call configuration accessors; current code treats methods as attributes and will fail at runtime.

These are methods (see configuration.py). This will raise AttributeError.

-    return configuration.user_data_collection_configuration.transcripts_enabled
+    return configuration.user_data_collection_configuration().transcripts_enabled
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def is_transcripts_enabled() -> bool:
"""Check if transcripts is enabled.
Returns:
bool: True if transcripts is enabled, False otherwise.
"""
return configuration.user_data_collection_configuration.transcripts_enabled
def is_transcripts_enabled() -> bool:
"""Check if transcripts is enabled.
Returns:
bool: True if transcripts is enabled, False otherwise.
"""
return configuration.user_data_collection_configuration().transcripts_enabled
🤖 Prompt for AI Agents
In src/utils/query.py around lines 55 to 62, the code treats
configuration.user_data_collection_configuration.transcripts_enabled as an
attribute but it's a method; change the return to call the method (e.g., return
configuration.user_data_collection_configuration.transcripts_enabled()) so it
invokes the accessor and returns the boolean; ensure the function signature/type
hints remain bool and consider adding a defensive try/except or logging if
configuration could be None or the call may raise.

Comment on lines 136 to 214
def select_model_and_provider_id(
models: ModelListResponse, model_id: str | None, provider_id: str | None
) -> tuple[str, str, str]:
"""
Select the model ID and provider ID based on the request or available models.
Determine and return the appropriate model and provider IDs for
a query request.
If the request specifies both model and provider IDs, those are used.
Otherwise, defaults from configuration are applied. If neither is
available, selects the first available LLM model from the provided model
list. Validates that the selected model exists among the available models.
Returns:
A tuple containing the combined model ID (in the format
"provider/model") and the provider ID.
Raises:
HTTPException: If no suitable LLM model is found or the selected model is not available.
"""
# If model_id and provider_id are provided in the request, use them

# If model_id is not provided in the request, check the configuration
if not model_id or not provider_id:
logger.debug(
"No model ID or provider ID specified in request, checking configuration"
)
model_id = configuration.inference.default_model # type: ignore[reportAttributeAccessIssue]
provider_id = (
configuration.inference.default_provider # type: ignore[reportAttributeAccessIssue]
)

# If no model is specified in the request or configuration, use the first available LLM
if not model_id or not provider_id:
logger.debug(
"No model ID or provider ID specified in request or configuration, "
"using the first available LLM"
)
try:
model = next(
m
for m in models
if m.model_type == "llm" # pyright: ignore[reportAttributeAccessIssue]
)
model_id = model.identifier
provider_id = model.provider_id
logger.info("Selected model: %s", model)
return model_id, model_id, provider_id
except (StopIteration, AttributeError) as e:
message = "No LLM model found in available models"
logger.error(message)
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail={
"response": constants.UNABLE_TO_PROCESS_RESPONSE,
"cause": message,
},
) from e

llama_stack_model_id = f"{provider_id}/{model_id}"
# Validate that the model_id and provider_id are in the available models
logger.debug("Searching for model: %s, provider: %s", model_id, provider_id)
if not any(
m.identifier == llama_stack_model_id and m.provider_id == provider_id
for m in models
):
message = f"Model {model_id} from provider {provider_id} not found in available models"
logger.error(message)
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail={
"response": constants.UNABLE_TO_PROCESS_RESPONSE,
"cause": message,
},
)

return llama_stack_model_id, model_id, provider_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix model/provider resolution: wrong early return tuple and improper config accessor usage.

  • Early return sets both first and second elements to the combined identifier.
  • Config accessors should be called.
-    if not model_id or not provider_id:
+    if not model_id or not provider_id:
         logger.debug(
             "No model ID or provider ID specified in request, checking configuration"
         )
-        model_id = configuration.inference.default_model  # type: ignore[reportAttributeAccessIssue]
-        provider_id = (
-            configuration.inference.default_provider  # type: ignore[reportAttributeAccessIssue]
-        )
+        model_id = configuration.inference().default_model
+        provider_id = configuration.inference().default_provider
@@
-        try:
+        try:
             model = next(
                 m
                 for m in models
                 if m.model_type == "llm"  # pyright: ignore[reportAttributeAccessIssue]
             )
-            model_id = model.identifier
-            provider_id = model.provider_id
-            logger.info("Selected model: %s", model)
-            return model_id, model_id, provider_id
+            llama_stack_model_id = model.identifier
+            provider_id = model.provider_id
+            # Derive plain model id when identifier is "provider/model"
+            model_id = llama_stack_model_id.split("/", 1)[1] if "/" in llama_stack_model_id else llama_stack_model_id
+            logger.info("Selected model: %s", llama_stack_model_id)
+            return llama_stack_model_id, model_id, provider_id

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/utils/query.py around lines 136–214, the function improperly overwrites
missing model/provider by unconditionally assigning config attributes without
calling them, and returns the wrong tuple when selecting the first LLM (it
returns model_id twice). Fix by calling the config accessors (e.g.
configuration.inference.default_model() and
configuration.inference.default_provider()) and only assign each of model_id or
provider_id when that specific value is missing, then when selecting the first
available LLM construct the combined id (provider/model) and return
(combined_id, model.identifier, model.provider_id) so the returned tuple is
(llama_stack_model_id, model_id, provider_id).

Comment on lines 307 to 308
logger.info("Llama stack config: %s", configuration.llama_stack_configuration)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid logging the full Llama Stack config; may include sensitive data. Also call the accessor.

Log only safe fields (e.g., version/provider list), not the entire object.

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    cfg = configuration.llama_stack_configuration()
+    logger.info("Llama Stack version: %s; providers: %s", getattr(cfg, "version", "n/a"), getattr(cfg, "providers", "n/a"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
cfg = configuration.llama_stack_configuration()
logger.info("Llama Stack version: %s; providers: %s", getattr(cfg, "version", "n/a"), getattr(cfg, "providers", "n/a"))
🤖 Prompt for AI Agents
In src/utils/query.py around lines 307-308, the code currently logs the entire
Llama stack configuration object which may contain sensitive data; instead call
the configuration accessor to extract and log only safe fields (for example:
configuration.llama_stack_configuration.get_version() and
configuration.llama_stack_configuration.get_providers() or equivalent accessors)
and format a concise message like "Llama stack config: version=%s providers=%s"
using those values; remove or replace the existing logger.info invocation so
full object contents are not written to logs.

Comment on lines 16 to 37
"""Check if given string is a proper session ID or response ID.
Args:
suid: The string to check.
Returns True if the string is a valid UUID, False otherwise.
Rules:
- If it starts with "resp-" or "resp_", accept as a valid Responses API ID (opaque).
- Otherwise, require a valid UUID string.
"""
if not isinstance(suid, str) or not suid:
return False

# Handle Responses API IDs
if suid.startswith("resp-") or suid.startswith("resp_"):
token = suid[5:]
if not token:
return False
# If truncated (e.g., shell cut reduced length), pad to canonical UUID length
if len(token) < 36:
token = token + ("0" * (36 - len(token)))
try:
uuid.UUID(token)
return True
except (ValueError, TypeError):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t coerce opaque resp- IDs into UUIDs; the zero‑padding path is incorrect and unsafe.*

Response API IDs are opaque; padding/truncation can accept malformed IDs and reject valid ones.

 def check_suid(suid: str) -> bool:
-    """Check if given string is a proper session ID or response ID.
+    """Check if given string is a proper session ID or response ID.
@@
-    if not isinstance(suid, str) or not suid:
+    if not isinstance(suid, str) or not suid:
         return False
-
-    # Handle Responses API IDs
-    if suid.startswith("resp-") or suid.startswith("resp_"):
-        token = suid[5:]
-        if not token:
-            return False
-        # If truncated (e.g., shell cut reduced length), pad to canonical UUID length
-        if len(token) < 36:
-            token = token + ("0" * (36 - len(token)))
-        try:
-            uuid.UUID(token)
-            return True
-        except (ValueError, TypeError):
-            return False
+    # Handle Responses API IDs – treat as opaque, minimally validated
+    if suid.startswith(("resp-", "resp_")):
+        token = suid[5:]
+        # Require a non-empty token with sane charset
+        return bool(token) and all(ch.isalnum() or ch in "-_." for ch in token)
@@
     try:
         uuid.UUID(suid)
         return True
     except (ValueError, TypeError):
         return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Check if given string is a proper session ID or response ID.
Args:
suid: The string to check.
Returns True if the string is a valid UUID, False otherwise.
Rules:
- If it starts with "resp-" or "resp_", accept as a valid Responses API ID (opaque).
- Otherwise, require a valid UUID string.
"""
if not isinstance(suid, str) or not suid:
return False
# Handle Responses API IDs
if suid.startswith("resp-") or suid.startswith("resp_"):
token = suid[5:]
if not token:
return False
# If truncated (e.g., shell cut reduced length), pad to canonical UUID length
if len(token) < 36:
token = token + ("0" * (36 - len(token)))
try:
uuid.UUID(token)
return True
except (ValueError, TypeError):
return False
def check_suid(suid: str) -> bool:
"""Check if given string is a proper session ID or response ID.
Rules:
- If it starts with "resp-" or "resp_", accept as a valid Responses API ID (opaque).
- Otherwise, require a valid UUID string.
"""
if not isinstance(suid, str) or not suid:
return False
# Handle Responses API IDs – treat as opaque, minimally validated
if suid.startswith(("resp-", "resp_")):
token = suid[5:]
# Require a non-empty token with sane charset
return bool(token) and all(ch.isalnum() or ch in "-_." for ch in token)
try:
uuid.UUID(suid)
return True
except (ValueError, TypeError):
return False
🤖 Prompt for AI Agents
In src/utils/suid.py around lines 16 to 37, the code currently treats resp- or
resp_ IDs as if they were UUIDs by zero-padding truncated tokens and attempting
uuid.UUID(token); remove that behavior and treat Responses API IDs as opaque: if
suid is a non-empty string and starts with "resp-" or "resp_", simply return
True when the token after the prefix is non-empty (do not pad or call
uuid.UUID); for all other values keep the existing UUID validation path (i.e.,
require a valid UUID string). Ensure type/empty checks remain and do not
introduce any coercion or padding logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/app/endpoints/query.py (3)

200-217: Security: avoid sending Bearer tokens to non‑HTTPS MCP servers by default.

The code forwards Authorization: Bearer <token> to every configured MCP server when headers are empty. This risks credential exposure if any server uses http:// or is misconfigured.

Apply:

+from urllib.parse import urlparse
@@
-        if not mcp_headers and token:
-            for mcp_server in configuration.mcp_servers:
-                mcp_headers[mcp_server.url] = {
-                    "Authorization": f"Bearer {token}",
-                }
+        if not mcp_headers and token:
+            for mcp_server in configuration.mcp_servers:
+                url = mcp_server.url
+                scheme = urlparse(url).scheme.lower()
+                if scheme != "https":
+                    logger.warning("Skipping non-HTTPS MCP server for auth header: %s", url)
+                    continue
+                mcp_headers[url] = {"Authorization": f"Bearer {token}"}

Optionally gate this behind a config flag (default off) and/or send tokens only to whitelisted servers that explicitly require auth.


178-182: Attachments privacy TODO is a compliance risk; prioritize redaction.

There’s a TODO to redact attachments before sending to the LLM. If attachments can include PII or secrets, add at least basic redaction or a feature flag to block unredacted uploads.

I can draft a pluggable redactor interface and a default regex‑based redactor for common PII (emails, phones, SSNs), gated by config.


54-56: Fix query_response to a valid FastAPI/OpenAPI responses mapping

src/utils/query.py defines query_response but several entries aren’t in FastAPI’s expected shape:

  • 200 is a raw example body (conversation_id/response) instead of a response mapping — use "model": QueryResponse or "content": {"application/json": {"example": {...}}}.
  • 503 uses a "detail" key with nested payload; replace with "description" and/or "content" (or a "model") to match FastAPI.
  • 400/403 already include "model" (OK).

Location: src/utils/query.py — query_response definition.

♻️ Duplicate comments (6)
src/app/endpoints/conversations.py (3)

100-100: Typed vs dict session object: .get will fail at runtime.
client.agents.session.list(...) likely returns typed models, not dicts. Accessing via .get risks AttributeError. Use type-safe access with a fallback and validate presence.

Apply this diff:

-        session_id = str(agent_sessions[0].get("session_id"))
+        first = agent_sessions[0]
+        session_id = (
+            first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+        )
+        if not session_id:
+            raise RuntimeError("session_id missing on session")
+        session_id = str(session_id)

155-159: Same typed vs dict issue on delete; also consider multiple sessions.
Using .get may break; fix similarly. Optionally, delete all sessions for the conversation instead of just the first.

Apply this minimal fix:

-    session_id = str(sessions[0].get("session_id"))
-    await client.agents.session.delete(agent_id=agent_id, session_id=session_id)
+    first = sessions[0]
+    session_id = (
+        first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+    )
+    if not session_id:
+        raise RuntimeError("session_id missing on session")
+    await client.agents.session.delete(agent_id=agent_id, session_id=str(session_id))

Optional improvement (delete all sessions):

-    first = sessions[0]
-    session_id = (
-        first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
-    )
-    if not session_id:
-        raise RuntimeError("session_id missing on session")
-    await client.agents.session.delete(agent_id=agent_id, session_id=str(session_id))
+    for s in sessions:
+        sid = s.get("session_id") if isinstance(s, dict) else getattr(s, "session_id", None)
+        if not sid:
+            continue
+        await client.agents.session.delete(agent_id=agent_id, session_id=str(sid))

182-188: Awaiting a sync base + passing async helpers will crash.
delete_conversation_base is synchronous in utils.conversations, yet you await it and pass async helpers; this will raise at runtime and the async helpers won’t be awaited inside the base.

Preferred fix: make the base async and await the helpers; keep this call site as await.

Patch (in src/utils/conversations.py):

-def delete_conversation_base(
+async def delete_conversation_base(
     request: Request,
     conversation_id: str,
     auth: Any,
-    get_session_func: callable,
-    delete_session_func: callable,
+    get_session_func: callable,
+    delete_session_func: callable,
 ) -> ConversationDeleteResponse:
@@
-        sessions = get_session_func(client, conversation_id)
+        sessions = await get_session_func(client, conversation_id)
@@
-        delete_session_func(client, conversation_id, sessions)
+        await delete_session_func(client, conversation_id, sessions)

If you must keep the base sync, then do not await it here and instead provide sync wrappers that offload the async client calls, or run the entire base in a threadpool. Async base is cleaner.

Run this to confirm current signature and callers:

#!/bin/bash
rg -nC2 -P '\bdef\s+delete_conversation_base\s*\(' src | sed -n '1,120p'
rg -nP 'await\s+delete_conversation_base\(' src | sed -n '1,200p'
pyproject.toml (1)

31-32: Add upper bounds for llama-stack(+client) to match MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION (0.2.21).

Prevents unsupported upgrades that will be rejected at runtime by the version gate.

-    "llama-stack>=0.2.19",
-    "llama-stack-client>=0.2.19",
+    "llama-stack>=0.2.19,<=0.2.21",
+    "llama-stack-client>=0.2.19,<=0.2.21",
src/utils/suid.py (1)

16-37: Treat resp- IDs as opaque; remove UUID padding/coercion.*

Zero‑padding and UUID parsing can accept malformed IDs and reject valid ones.

 def check_suid(suid: str) -> bool:
@@
-    # Handle Responses API IDs
-    if suid.startswith("resp-") or suid.startswith("resp_"):
-        token = suid[5:]
-        if not token:
-            return False
-        # If truncated (e.g., shell cut reduced length), pad to canonical UUID length
-        if len(token) < 36:
-            token = token + ("0" * (36 - len(token)))
-        try:
-            uuid.UUID(token)
-            return True
-        except (ValueError, TypeError):
-            return False
+    # Handle Responses API IDs – treat as opaque
+    if suid.startswith(("resp-", "resp_")):
+        token = suid[5:]
+        return bool(token) and all(ch.isalnum() or ch in "-_." for ch in token)
src/app/endpoints/query_v2.py (1)

25-29: Validate conversation ownership before chaining via previous_response_id.

Prevents cross‑user context leakage; mirrors v1 parity.

 from utils.endpoints import (
     check_configuration_loaded,
     get_system_prompt,
     validate_model_provider_override,
+    validate_conversation_ownership,
 )
@@
-    user_conversation: UserConversation | None = None
-    if query_request.conversation_id:
-        # TODO: Implement conversation once Llama Stack supports its API
-        pass
+    user_conversation: UserConversation | None = None
+    if query_request.conversation_id:
+        user_conversation = validate_conversation_ownership(
+            user_id=user_id, conversation_id=query_request.conversation_id
+        )
+        if user_conversation is None:
+            raise HTTPException(
+                status_code=status.HTTP_403_FORBIDDEN,
+                detail={
+                    "response": "Access denied",
+                    "cause": "You do not have permission to access this conversation",
+                },
+            )

Also applies to: 79-83

🧹 Nitpick comments (11)
src/utils/streaming_query.py (2)

80-100: Use resilient keys for doc URL in end event.

Metadata may use "doc_url" or "docs_url". Accept both to avoid silent drops.

-                "referenced_documents": [
-                    {
-                        "doc_url": v["docs_url"],
-                        "doc_title": v["title"],
-                    }
-                    for v in filter(
-                        lambda v: ("docs_url" in v) and ("title" in v),
-                        metadata_map.values(),
-                    )
-                ],
+                "referenced_documents": [
+                    {
+                        "doc_url": v.get("docs_url") or v.get("doc_url"),
+                        "doc_title": v["title"],
+                    }
+                    for v in filter(
+                        lambda v: (("docs_url" in v) or ("doc_url" in v)) and ("title" in v),
+                        metadata_map.values(),
+                    )
+                ],

304-329: Handle dict-shaped tool_call deltas.

Some providers emit tool_call deltas as dicts; add a safe branch.

-        if chunk.event.payload.delta.type == "tool_call":
-            if isinstance(chunk.event.payload.delta.tool_call, str):
+        if chunk.event.payload.delta.type == "tool_call":
+            if isinstance(chunk.event.payload.delta.tool_call, str):
                 ...
-            elif isinstance(chunk.event.payload.delta.tool_call, ToolCall):
+            elif isinstance(chunk.event.payload.delta.tool_call, ToolCall):
                 ...
+            elif isinstance(chunk.event.payload.delta.tool_call, dict):
+                name = (
+                    chunk.event.payload.delta.tool_call.get("tool_name")
+                    or chunk.event.payload.delta.tool_call.get("function", {}).get("name")
+                    or "tool_call"
+                )
+                yield format_stream_data(
+                    {"event": "tool_call",
+                     "data": {"id": chunk_id, "role": chunk.event.payload.step_type, "token": name}}
+                )
src/app/endpoints/query_v2.py (4)

197-206: Guard previous_response_id to only pass resp- IDs (avoid mixing v1 UUIDs).*

Prevents API errors when a v1 conversation UUID is supplied.

-    response = await client.responses.create(
+    prev_resp_id = (
+        query_request.conversation_id
+        if query_request.conversation_id and query_request.conversation_id.startswith(("resp-", "resp_"))
+        else None
+    )
+    response = await client.responses.create(
         input=query_request.query,
         model=model_id,
         instructions=system_prompt,
-        previous_response_id=query_request.conversation_id,
+        previous_response_id=prev_resp_id,
         tools=tools if tools else None,
         stream=False,
         store=True,
     )

233-235: Lower response-content logging to DEBUG to reduce sensitive data in logs.

-                logger.info("Model response content: '%s'",
+                logger.debug("Model response content: '%s'",
                            llm_response[:200] + "..." if len(llm_response) > 200 else llm_response)

190-195: Propagate per-server MCP headers into tool definitions.

The dependency already provides headers; merge them with Authorization.

-        mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
+        mcp_tools = get_mcp_tools(configuration.mcp_servers, token, mcp_headers)
-def get_mcp_tools(mcp_servers: list, token: str | None = None) -> list[dict]:
+def get_mcp_tools(
+    mcp_servers: list,
+    token: str | None = None,
+    mcp_headers: dict[str, dict[str, str]] | None = None,
+) -> list[dict]:
@@
-        # Add authentication if token provided (Response API format)
-        if token:
-            tool_def["headers"] = {
-                "Authorization": f"Bearer {token}"
-            }
+        # Compose headers: Authorization + per-server forwarded headers
+        headers: dict[str, str] = {}
+        if token:
+            headers["Authorization"] = f"Bearer {token}"
+        if mcp_headers and mcp_server.name in mcp_headers:
+            headers.update(mcp_headers[mcp_server.name])
+        if headers:
+            tool_def["headers"] = headers

Also applies to: 284-303


6-11: Avoid runtime dependency on server-side types for casting.

Use TYPE_CHECKING to keep imports type-only and prevent version coupling with llama_stack server package at runtime.

-from llama_stack.apis.agents.openai_responses import (
-    OpenAIResponseObject,
-)
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:  # type-only to avoid runtime coupling
+    from llama_stack_client.types.responses import OpenAIResponseObject  # pragma: no cover
@@
-    response = cast(OpenAIResponseObject, response)
+    response = cast("OpenAIResponseObject", response)  # type: ignore[name-defined]

Also applies to: 207-208

src/app/endpoints/query.py (5)

85-90: Cache model listing to cut per‑request latency.

await client.models.list() on every call can be expensive. Add a short TTL cache (e.g., 30–60s) in the client holder or utils to avoid repeated network lists in hot paths.


91-99: Name clarity: pass stack model id as such.

You pass llama_stack_model_id into retrieve_response(model_id=...). Consider renaming the parameter in retrieve_response to stack_model_id (and the callsite) to avoid confusion with model_id used for metrics/transcripts.


158-165: Avoid double network call to list shields.

await client.shields.list() is invoked twice. Fetch once and derive both input/output lists.

Apply:

-    available_input_shields = [
-        shield.identifier
-        for shield in filter(is_input_shield, await client.shields.list())
-    ]
-    available_output_shields = [
-        shield.identifier
-        for shield in filter(is_output_shield, await client.shields.list())
-    ]
+    shields = await client.shields.list()
+    available_input_shields = [s.identifier for s in filter(is_input_shield, shields)]
+    available_output_shields = [s.identifier for s in filter(is_output_shield, shields)]

175-176: Don’t log full system prompts.

System prompts may include sensitive policy text or secrets. Log a hash or length, or truncate to a small prefix.

Apply:

-    logger.debug("Using system prompt: %s", system_prompt)
+    logger.debug("Using system prompt (len=%d, sha256=%s)", len(system_prompt), __import__("hashlib").sha256(system_prompt.encode()).hexdigest()[:12])

218-224: Confirm typing; avoid per-request vector DB listing.

  • Typing: Toolgroup = Union[str, ToolgroupAgentToolGroupWithArgs]; agent.create_turn accepts List[Toolgroup], so mixing plain toolgroup id strings and structured dicts is valid — no change required.
  • Performance: calling await client.vector_dbs.list() on every request (src/app/endpoints/query.py: lines 218–224) can add latency. Cache the vector DB IDs with a short TTL or fetch them only when RAG is enabled/needed.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3d8748 and 692ede7.

📒 Files selected for processing (13)
  • pyproject.toml (1 hunks)
  • src/app/endpoints/conversations.py (7 hunks)
  • src/app/endpoints/conversations_v2.py (1 hunks)
  • src/app/endpoints/query.py (6 hunks)
  • src/app/endpoints/query_v2.py (1 hunks)
  • src/app/endpoints/streaming_query.py (2 hunks)
  • src/app/endpoints/streaming_query_v2.py (1 hunks)
  • src/app/routers.py (2 hunks)
  • src/constants.py (1 hunks)
  • src/utils/conversations.py (1 hunks)
  • src/utils/query.py (1 hunks)
  • src/utils/streaming_query.py (1 hunks)
  • src/utils/suid.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/app/routers.py
  • src/utils/conversations.py
  • src/utils/query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/conversations_v2.py
🧰 Additional context used
🧬 Code graph analysis (4)
src/utils/streaming_query.py (4)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • append_tool_calls_from_llama (65-78)
src/utils/endpoints.py (1)
  • get_system_prompt (71-111)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/app/endpoints/streaming_query.py (1)
  • response_generator (140-201)
src/app/endpoints/query_v2.py (6)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/configuration.py (3)
  • configuration (60-64)
  • llama_stack_configuration (74-78)
  • mcp_servers (88-92)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/utils/endpoints.py (3)
  • check_configuration_loaded (62-68)
  • get_system_prompt (71-111)
  • validate_model_provider_override (114-134)
src/utils/query.py (4)
  • evaluate_model_hints (94-133)
  • process_transcript_and_persist_conversation (352-383)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/query.py (6)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/models/responses.py (1)
  • QueryResponse (48-79)
src/utils/endpoints.py (2)
  • get_agent (138-187)
  • get_system_prompt (71-111)
src/utils/mcp_headers.py (2)
  • mcp_headers_dependency (15-26)
  • handle_mcp_headers_with_toolgroups (56-90)
src/utils/types.py (1)
  • TurnSummary (59-78)
src/utils/query.py (8)
  • evaluate_model_hints (94-133)
  • select_model_and_provider_id (136-213)
  • is_input_shield (242-255)
  • is_output_shield (229-239)
  • validate_attachments_metadata (258-287)
  • validate_query_request (290-335)
  • handle_api_connection_error (338-349)
  • process_transcript_and_persist_conversation (352-383)
src/app/endpoints/conversations.py (3)
src/utils/conversations.py (5)
  • simplify_session_data (109-148)
  • get_conversations_list_base (151-207)
  • validate_conversation_id (210-220)
  • validate_conversation_access (223-259)
  • delete_conversation_base (262-360)
src/models/config.py (1)
  • Action (305-344)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
🔇 Additional comments (12)
src/app/endpoints/conversations.py (9)

1-1: Docstring update aligns with Agent API usage.
Clear and accurate.


21-30: Good centralization on utils.conversations.
Imports look correct and promote reuse.


55-55: Docstring clarifications look good.
Accurately describes Agent API flow and error semantics.

Also applies to: 58-58


72-72: Early ID validation is correct.
Good to fail fast on malformed IDs.


76-81: Access check is consistent with READ_OTHERS policy.
Looks correct given Action.READ_OTHERS_CONVERSATIONS.


84-84: Logging is fine.
Message is informative and scoped.


89-89: (no functional change)


149-153: Helper is correct and concise.
Returns the raw .data which aligns with usage.


172-177: Deletion docstring is accurate.
Matches the behavior and mapped error codes.

src/constants.py (1)

5-5: Version cap aligns with the proposed dependency bounds.

LGTM. Once pyproject upper bounds are added, this gate will be consistent.

src/app/endpoints/query.py (2)

100-101: No change needed — label order is correct.

llm_calls_total is declared with labels ["provider","model"] in src/metrics/init.py and call sites pass (provider_id, model_id) in the endpoints — no changes required.


77-81: Avoid positional unpacking of AuthTuple; use named access or a helper.

Replace "_, _, _, token = auth" with a named attribute (e.g., auth.token) or a small helper that safely extracts the token — positional unpacking is brittle if AuthTuple changes. Location: src/app/endpoints/query.py (lines 77–81). Verification attempt failed: running rg for "AuthTuple" returned "authentication: No such file or directory (os error 2)"; AuthTuple definition not found in repo — confirm its shape and update the unpacking accordingly.

"cause": f"Unknown error while getting conversations for user {user_id}",
},
) from e
return get_conversations_list_base(request, auth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid blocking the event loop when calling a sync base.
get_conversations_list_base is synchronous and does DB I/O; calling it directly from an async endpoint will block the event loop. Run it in a threadpool.

Apply this diff:

@@
-from fastapi import APIRouter, HTTPException, Request, status, Depends
+from fastapi import APIRouter, HTTPException, Request, status, Depends
+from starlette.concurrency import run_in_threadpool
@@
-    return get_conversations_list_base(request, auth)
+    return await run_in_threadpool(get_conversations_list_base, request, auth)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return get_conversations_list_base(request, auth)
from fastapi import APIRouter, HTTPException, Request, status, Depends
from starlette.concurrency import run_in_threadpool
return await run_in_threadpool(get_conversations_list_base, request, auth)
🤖 Prompt for AI Agents
In src/app/endpoints/conversations.py around line 44, the endpoint currently
calls the synchronous get_conversations_list_base directly which blocks the
event loop; change the call to run the sync DB I/O in a threadpool (use
asyncio.to_thread or loop.run_in_executor) and await its result instead. Import
asyncio if needed, replace the direct return with an awaited
asyncio.to_thread(get_conversations_list_base, request, auth) (or equivalent
run_in_executor call) so the blocking work executes off the event loop and the
endpoint remains async.

Comment on lines +75 to +77
logger.info("Llama stack config: %s", configuration.llama_stack_configuration)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid logging full Llama Stack configuration at INFO (may expose secrets/URLs).

Log only mode or move to DEBUG without printing the object.

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    logger.debug(
+        "Llama Stack configured (mode=%s)",
+        "library" if configuration.llama_stack_configuration.use_as_library_client else "service",
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
logger.debug(
"Llama Stack configured (mode=%s)",
"library" if configuration.llama_stack_configuration.use_as_library_client else "service",
)
🤖 Prompt for AI Agents
In src/app/endpoints/query_v2.py around lines 75-76, the current INFO log prints
the entire Llama stack configuration which may expose secrets/URLs; change this
to either logger.debug(...) instead of logger.info(...) or log only
non-sensitive fields (e.g., configuration.llama_stack_configuration.mode or an
explicit safe subset), and ensure any sensitive fields are omitted or masked
before logging.

Comment on lines 123 to 129
if hasattr(chunk, "error"):
yield from _handle_error_event(chunk, chunk_id)

event_type = chunk.event.payload.event_type
step_type = getattr(chunk.event.payload, "step_type", None)

match (event_type, step_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Error chunks should short‑circuit to avoid AttributeError.

After yielding the error event, the function continues and accesses chunk.event, which may not exist on error chunks.

-    if hasattr(chunk, "error"):
-        yield from _handle_error_event(chunk, chunk_id)
+    if hasattr(chunk, "error"):
+        yield from _handle_error_event(chunk, chunk_id)
+        return

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/utils/streaming_query.py around lines 123 to 129, after yielding the
error event the function continues and then accesses chunk.event which may not
exist on error chunks; modify the control flow so that once hasattr(chunk,
"error") yields the error (yield from _handle_error_event(...)) the function
short‑circuits (e.g., return or continue the loop) to avoid further access to
chunk.event, ensuring error chunks do not execute the subsequent
event_type/step_type logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unit/app/endpoints/test_query.py (1)

113-124: Patch mock target to utils.query.configuration (not app.endpoints.query.configuration)

Replace all occurrences in tests/unit/app/endpoints/test_query.py (lines: 121, 129, 151, 446, 478, 511, 550, 600, 653, 708, 757, 807, 876, 946, 1008, 1105, 1176, 1232, 1283, 1342, 1393, 1559): change
mocker.patch("app.endpoints.query.configuration", ...) -> mocker.patch("utils.query.configuration", ...)

src/app/endpoints/query.py (1)

246-248: Token metrics label mismatch — normalize model label (no provider prefix) across metrics.

Observed mismatch: query.py derives model_label by splitting "provider/model" and passes (model_label, provider_id) to update_llm_token_count_from_turn, but src/metrics/utils.py uses model.identifier directly when registering provider/model metrics, and src/utils/query.py sometimes returns inconsistent values from select_model_and_provider_id.

  • Problem locations: src/app/endpoints/query.py (lines ~244–248), src/metrics/utils.py (update_llm_token_count_from_turn & setup_model_metrics), src/utils/query.py (select_model_and_provider_id).
  • Impact: metrics.provider_model_configuration and runtime token metrics may use different label keys (e.g., ("provider","provider/model") vs ("provider","model")), causing scattered/incorrect metric counts.
  • Fixes:
    • In src/metrics/utils.py, normalize model_name before labeling: model_name = model.identifier.split("/", 1)[1] if "/" in model.identifier else model.identifier.
    • In src/utils/query.py, ensure select_model_and_provider_id always returns a consistent tuple: (llama_stack_model_id as "provider/model", plain_model_name, provider_id) instead of returning model_id twice in the first-LLM branch.
    • Alternatively, add a single helper (e.g., utils.query.get_model_label(llama_stack_model_id)) and use it everywhere to compute the metric label.
♻️ Duplicate comments (12)
src/app/endpoints/conversations.py (4)

37-45: Don’t block the event loop: run the sync base in a threadpool

get_conversations_list_base is synchronous and does DB I/O. Calling it directly from an async route will block.

Apply:

-    return get_conversations_list_base(request, auth)
+    return await run_in_threadpool(get_conversations_list_base, request, auth)

Add import (outside this hunk):

from starlette.concurrency import run_in_threadpool

92-105: Type-safe session_id extraction (avoid .get on typed SDK models)

SDK objects are typed; .get may raise at runtime. Validate and handle missing IDs explicitly.

-        session_id = str(agent_sessions[0].get("session_id"))
+        first = agent_sessions[0]
+        session_id = (
+            first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+        )
+        if not session_id:
+            raise HTTPException(
+                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                detail={"response": "Invalid session", "cause": "session_id missing"},
+            )
+        session_id = str(session_id)
#!/bin/bash
# Find any remaining fragile `.get("session_id")` usages across the repo.
rg -nP -C2 '\.get\(\s*[\"\']session_id[\"\']\s*\)' || true

187-193: LGTM: async base awaited correctly

Awaiting delete_conversation_base with async helpers resolves the earlier await-mismatch.


158-165: Same typed-access bug in delete helper

Avoid .get; validate session_id before delete.

-    session_id = str(sessions[0].get("session_id"))
-    await client.agents.session.delete(agent_id=agent_id, session_id=session_id)
+    first = sessions[0]
+    session_id = (
+        first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+    )
+    if not session_id:
+        raise RuntimeError("session_id missing on session")
+    await client.agents.session.delete(agent_id=agent_id, session_id=str(session_id))
src/utils/query.py (4)

55-62: BUG: configuration accessors must be called (will raise at runtime).

Use configuration.user_data_collection_configuration() accessor before reading transcripts_enabled.

Apply:

-    return configuration.user_data_collection_configuration.transcripts_enabled
+    return configuration.user_data_collection_configuration().transcripts_enabled

158-169: BUG: Don’t overwrite provided model/provider; call config accessors.

Current branch assigns both defaults when either is missing and uses attributes instead of accessors.

Apply:

-    if not model_id or not provider_id:
+    if not model_id or not provider_id:
         logger.debug(
             "No model ID or provider ID specified in request, checking configuration"
         )
-        model_id = configuration.inference.default_model  # type: ignore[reportAttributeAccessIssue]
-        provider_id = (
-            configuration.inference.default_provider  # type: ignore[reportAttributeAccessIssue]
-        )
+        if not model_id:
+            model_id = configuration.inference().default_model
+        if not provider_id:
+            provider_id = configuration.inference().default_provider

175-185: BUG: Wrong tuple and identifiers when auto‑selecting first LLM.

Return value duplicates model_id and doesn’t derive plain model id from provider/model identifier.

Apply:

-            model = next(
+            model = next(
                 m
                 for m in models
                 if m.model_type == "llm"  # pyright: ignore[reportAttributeAccessIssue]
             )
-            model_id = model.identifier
-            provider_id = model.provider_id
-            logger.info("Selected model: %s", model)
-            return model_id, model_id, provider_id
+            llama_stack_model_id = model.identifier
+            provider_id = model.provider_id
+            model_id = (
+                llama_stack_model_id.split("/", 1)[1]
+                if "/" in llama_stack_model_id
+                else llama_stack_model_id
+            )
+            logger.info("Selected model: %s", llama_stack_model_id)
+            return llama_stack_model_id, model_id, provider_id

307-308: Don’t log entire Llama Stack config; call accessor and log safe fields.

Full config may include secrets; also needs accessor call.

Apply:

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    cfg = configuration.llama_stack_configuration()
+    logger.info(
+        "Llama Stack version: %s; providers: %s",
+        getattr(cfg, "version", "n/a"),
+        getattr(cfg, "providers", "n/a"),
+    )
src/app/endpoints/streaming_query_v2.py (1)

308-308: SSE headers missing (clients may buffer or misinterpret stream).

Return StreamingResponse with proper media type and no‑buffering headers.

Apply:

-        return StreamingResponse(response_generator(response))
+        return StreamingResponse(
+            response_generator(response),
+            media_type="text/event-stream",
+            headers={
+                "Cache-Control": "no-cache",
+                "Connection": "keep-alive",
+                "X-Accel-Buffering": "no",
+            },
+        )
src/utils/streaming_query.py (1)

123-129: Short-circuit on error chunks to avoid AttributeError.

After yielding the error event, the function still accesses chunk.event.* which may not exist on error chunks.

     if hasattr(chunk, "error"):
-        yield from _handle_error_event(chunk, chunk_id)
+        yield from _handle_error_event(chunk, chunk_id)
+        return
src/app/endpoints/query_v2.py (2)

66-69: Avoid logging full Llama Stack configuration at INFO in validate_query_request.

validate_query_request currently logs the entire configuration object at INFO, which risks leaking secrets/URLs. Downgrade to DEBUG and log only safe fields.

Proposed change in src/utils/query.py:

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    logger.debug(
+        "Llama Stack configured (mode=%s)",
+        "library"
+        if configuration.llama_stack_configuration.use_as_library_client
+        else "service",
+    )

226-235: Don’t log user/model content at INFO.

Even truncated, response text can contain sensitive data. Move to DEBUG.

-                logger.info(
+                logger.debug(
                     "Model response content: '%s'",
                     (
                         llm_response[:200] + "..."
                         if len(llm_response) > 200
                         else llm_response
                     ),
                 )
🧹 Nitpick comments (18)
src/app/endpoints/conversations.py (3)

92-101: 404 isn’t an error condition; lower log level

“No sessions found” maps to 404; use warning to reduce error-noise.

-            logger.error("No sessions found for conversation %s", conversation_id)
+            logger.warning("No sessions found for conversation %s", conversation_id)

152-156: Fix return type: client.agents.session.list(...).data returns typed models, not dicts

Current annotation misleads and encourages dict-style access.

-async def get_agent_sessions(client: Any, conversation_id: str) -> list[dict[str, Any]]:
+async def get_agent_sessions(client: Any, conversation_id: str) -> list[Any]:

92-105: Prefer latest session, not arbitrary first

If multiple sessions exist, select the most recent by created_at (if available).

# before picking first:
sessions_sorted = sorted(
    agent_sessions,
    key=lambda s: (getattr(s, "created_at", None) or (s.get("created_at") if isinstance(s, dict) else None)) or "",
    reverse=True,
)
first = sessions_sorted[0]
src/utils/query.py (4)

150-156: Docstring is inaccurate (function returns 3 values).

Update to document (llama_stack_model_id, model_id, provider_id).

Apply:

-    Returns:
-        A tuple containing the combined model ID (in the format
-        "provider/model") and the provider ID.
+    Returns:
+        tuple[str, str, str]: (llama_stack_model_id "provider/model",
+        model_id, provider_id).

242-256: Tighten shield classification; current logic labels unknowns as input.

Use explicit prefix checks; avoid “not output ⇒ input”.

Apply:

-def is_input_shield(shield: Shield) -> bool:
+def is_input_shield(shield: Shield) -> bool:
@@
-    return _is_inout_shield(shield) or not is_output_shield(shield)
+    return _is_inout_shield(shield) or shield.identifier.startswith("input_")

33-52: Align OpenAPI response codes with behavior.

Docs list 400/403/503, while handlers raise 401/500 in places. Consider 401 for UnauthorizedResponse, and 500 (or 503 consistently) for upstream connectivity.

Apply (example):

-    400: {
-        "description": "Missing or invalid credentials provided by client",
-        "model": UnauthorizedResponse,
-    },
+    401: {
+        "description": "Missing or invalid credentials provided by client",
+        "model": UnauthorizedResponse,
+    },
@@
-    503: {
+    500: {
         "detail": {
             "response": "Unable to connect to Llama Stack",
             "cause": "Connection error.",
         }
     },

290-299: Minor: Drop Depends annotation in utility signature.

This is not a FastAPI endpoint; the Depends marker is misleading.

Apply:

-def validate_query_request(
+def validate_query_request(
     request: Request,
     query_request: QueryRequest,
-    auth: Annotated[AuthTuple, Depends],
+    auth: AuthTuple,
 ) -> tuple[str, UserConversation | None, str]:
tests/unit/app/test_routers.py (2)

67-67: Avoid brittle magic number for router count.

Asserting exact length will break when routers are added/removed. Prefer asserting presence only.

Apply:

-    assert len(app.routers) == 14
+    # avoid brittle count; presence checks below suffice

91-91: Same here: drop hardcoded router count.

Apply:

-    assert len(app.routers) == 14
+    # avoid brittle count; verify prefixes instead
tests/unit/app/endpoints/test_conversations.py (1)

273-274: Be consistent about patch targets.

You patch utils.conversations.check_suid here but configuration is patched under app.endpoints.conversations elsewhere. Prefer a single module boundary (utils.conversations) to reduce surprises.

tests/unit/app/endpoints/test_streaming_query_v2.py (1)

156-166: Consider asserting SSE headers.

Add checks for media_type text/event-stream and no-buffering headers to catch regressions.

Example:

assert resp.media_type == "text/event-stream"
assert resp.headers.get("X-Accel-Buffering") == "no"
src/app/endpoints/streaming_query_v2.py (2)

329-396: MCP headers param is unused; either plumb it or drop it.

You accept mcp_headers but never apply it to tool definitions.

Option A: merge headers per server URL:

-        mcp_tools = get_mcp_tools(configuration.mcp_servers(), token)
+        mcp_tools = get_mcp_tools(configuration.mcp_servers(), token, mcp_headers=mcp_headers)

…paired with updating get_mcp_tools signature to merge provided headers:

-def get_mcp_tools(mcp_servers: list, token: str | None = None) -> list[dict[str, Any]]:
+def get_mcp_tools(mcp_servers: list, token: str | None = None, mcp_headers: dict[str, dict[str, str]] | None = None) -> list[dict[str, Any]]:
@@
-        if token:
-            tool_def["headers"] = {"Authorization": f"Bearer {token}"}
+        headers = {}
+        if token:
+            headers["Authorization"] = f"Bearer {token}"
+        if mcp_headers:
+            headers.update(mcp_headers.get(mcp_server.url, {}))
+        if headers:
+            tool_def["headers"] = headers

Option B: remove mcp_headers from signatures to avoid dead parameters (less flexible).


12-15: Nit: unused import “status” from starlette.

Remove if not used.

-from starlette import status
tests/unit/app/endpoints/test_query.py (1)

237-243: Wrong default model/provider patch target.

select_model_and_provider_id reads defaults from utils.query.configuration, not metrics.utils.configuration. Patch the correct target so configuration-driven tests behave as intended.

-    mocker.patch(
-        "metrics.utils.configuration.inference.default_provider",
-        "default_provider",
-    )
-    mocker.patch(
-        "metrics.utils.configuration.inference.default_model",
-        "default_model",
-    )
+    mocker.patch("utils.query.configuration.inference.default_provider", "default_provider")
+    mocker.patch("utils.query.configuration.inference.default_model", "default_model")

Also applies to: 277-283

src/utils/streaming_query.py (1)

23-24: Make METADATA_PATTERN non-greedy and EOL-tolerant.

Current pattern can over-capture and requires a trailing newline. Use a non-greedy match and optional EOL.

-METADATA_PATTERN = re.compile(r"\nMetadata: (\{.+})\n")
+METADATA_PATTERN = re.compile(r"\nMetadata:\s*(\{.*?\})(?:\n|$)")
src/app/endpoints/conversations_v2.py (1)

27-29: Nit: use module-specific logger name for filtering.

Helps log routing and ops.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger("app.endpoints.conversations_v2")
src/app/endpoints/streaming_query.py (1)

62-63: Remove unused METADATA_PATTERN.

SSE parsing moved to utils.streaming_query; this constant is now dead.

-METADATA_PATTERN = re.compile(r"\nMetadata: (\{.+})\n")
src/app/endpoints/query_v2.py (1)

124-131: Wire mcp_headers into Responses API tool headers.

mcp_headers is accepted by the endpoint but unused in retrieve_response(). Merge provided headers with token-derived headers so callers can scope per-server credentials.

@@
-async def retrieve_response(  # pylint: disable=too-many-locals,too-many-branches
+async def retrieve_response(  # pylint: disable=too-many-locals,too-many-branches
@@
-    mcp_headers: dict[str, dict[str, str]] | None = None,
+    mcp_headers: dict[str, dict[str, str]] | None = None,
@@
-        mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
+        mcp_tools = get_mcp_tools(configuration.mcp_servers, token, mcp_headers)
         if mcp_tools:
             tools.extend(mcp_tools)
@@
-def get_mcp_tools(mcp_servers: list, token: str | None = None) -> list[dict[str, Any]]:
+def get_mcp_tools(
+    mcp_servers: list, token: str | None = None, extra_headers: dict[str, dict[str, str]] | None = None
+) -> list[dict[str, Any]]:
@@
-    for mcp_server in mcp_servers:
+    for mcp_server in mcp_servers:
         tool_def = {
             "type": "mcp",
             "server_label": mcp_server.name,
             "server_url": mcp_server.url,
             "require_approval": "never",
         }
 
-        # Add authentication if token provided (Response API format)
-        if token:
-            tool_def["headers"] = {"Authorization": f"Bearer {token}"}
+        # Merge auth from explicit headers and/or token
+        headers: dict[str, str] = {}
+        if extra_headers and (h := extra_headers.get(mcp_server.url) or extra_headers.get(mcp_server.name)):
+            headers.update(h)
+        if token and "Authorization" not in headers:
+            headers["Authorization"] = f"Bearer {token}"
+        if headers:
+            tool_def["headers"] = headers

Also applies to: 176-185, 312-329

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 692ede7 and 61301f7.

📒 Files selected for processing (18)
  • src/app/endpoints/conversations.py (7 hunks)
  • src/app/endpoints/conversations_v2.py (1 hunks)
  • src/app/endpoints/query.py (5 hunks)
  • src/app/endpoints/query_v2.py (1 hunks)
  • src/app/endpoints/streaming_query.py (2 hunks)
  • src/app/endpoints/streaming_query_v2.py (1 hunks)
  • src/app/routers.py (3 hunks)
  • src/constants.py (1 hunks)
  • src/utils/conversations.py (1 hunks)
  • src/utils/query.py (1 hunks)
  • src/utils/streaming_query.py (1 hunks)
  • src/utils/suid.py (1 hunks)
  • tests/unit/app/endpoints/test_conversations.py (14 hunks)
  • tests/unit/app/endpoints/test_query.py (7 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (1 hunks)
  • tests/unit/app/endpoints/test_streaming_query_v2.py (1 hunks)
  • tests/unit/app/test_routers.py (5 hunks)
  • tests/unit/conftest.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/utils/suid.py
  • src/app/routers.py
  • src/utils/conversations.py
🧰 Additional context used
🧬 Code graph analysis (11)
tests/unit/app/endpoints/test_streaming_query_v2.py (4)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/models/config.py (3)
  • config (132-138)
  • ModelContextProtocolServer (161-166)
  • Action (305-344)
src/app/endpoints/streaming_query_v2.py (1)
  • streaming_query_endpoint_handler_v2 (59-321)
src/configuration.py (1)
  • mcp_servers (88-92)
tests/unit/app/endpoints/test_query_v2.py (4)
src/models/requests.py (2)
  • QueryRequest (72-222)
  • Attachment (15-69)
src/models/config.py (2)
  • config (132-138)
  • ModelContextProtocolServer (161-166)
src/app/endpoints/query_v2.py (3)
  • get_rag_tools (298-309)
  • get_mcp_tools (312-328)
  • query_endpoint_handler_v2 (45-116)
src/configuration.py (1)
  • mcp_servers (88-92)
src/app/endpoints/conversations_v2.py (5)
src/authentication/__init__.py (1)
  • get_auth_dependency (14-43)
src/authorization/middleware.py (1)
  • authorize (111-122)
src/models/config.py (2)
  • config (132-138)
  • Action (305-344)
src/models/responses.py (3)
  • ConversationDeleteResponse (443-476)
  • ConversationResponse (393-440)
  • ConversationsListResponse (540-597)
src/utils/conversations.py (4)
  • get_conversations_list_base (150-206)
  • delete_conversation_base (259-357)
  • validate_conversation_id (209-219)
  • validate_conversation_access (222-256)
src/app/endpoints/conversations.py (5)
src/models/config.py (2)
  • config (132-138)
  • Action (305-344)
src/models/responses.py (3)
  • ConversationResponse (393-440)
  • ConversationDeleteResponse (443-476)
  • ConversationsListResponse (540-597)
src/utils/endpoints.py (1)
  • check_configuration_loaded (62-68)
src/utils/conversations.py (5)
  • simplify_session_data (108-147)
  • get_conversations_list_base (150-206)
  • validate_conversation_id (209-219)
  • validate_conversation_access (222-256)
  • delete_conversation_base (259-357)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/app/endpoints/streaming_query_v2.py (14)
src/authentication/__init__.py (1)
  • get_auth_dependency (14-43)
src/authorization/middleware.py (1)
  • authorize (111-122)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/configuration.py (3)
  • configuration (60-64)
  • llama_stack_configuration (74-78)
  • mcp_servers (88-92)
src/models/config.py (2)
  • config (132-138)
  • Action (305-344)
src/models/database/conversations.py (1)
  • UserConversation (11-36)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/utils/endpoints.py (4)
  • check_configuration_loaded (62-68)
  • get_system_prompt (71-111)
  • validate_model_provider_override (114-134)
  • validate_conversation_ownership (39-59)
src/utils/mcp_headers.py (1)
  • mcp_headers_dependency (15-26)
src/utils/query.py (5)
  • evaluate_model_hints (94-133)
  • is_transcripts_enabled (55-61)
  • persist_user_conversation_details (64-91)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
src/utils/streaming_query.py (3)
  • format_stream_data (26-37)
  • stream_start_event (40-61)
  • stream_end_event (64-100)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/query_v2.py (2)
  • get_rag_tools (298-309)
  • get_mcp_tools (312-328)
tests/unit/app/endpoints/test_query.py (1)
src/utils/query.py (2)
  • is_transcripts_enabled (55-61)
  • select_model_and_provider_id (136-213)
src/app/endpoints/query_v2.py (6)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/models/responses.py (1)
  • QueryResponse (48-79)
src/utils/endpoints.py (1)
  • get_system_prompt (71-111)
src/utils/query.py (4)
  • evaluate_model_hints (94-133)
  • process_transcript_and_persist_conversation (352-383)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/streaming_query.py (3)
src/app/endpoints/query.py (1)
  • get_rag_toolgroups (266-293)
src/utils/endpoints.py (3)
  • get_system_prompt (71-111)
  • validate_model_provider_override (114-134)
  • validate_conversation_ownership (39-59)
src/utils/streaming_query.py (3)
  • stream_start_event (40-61)
  • stream_end_event (64-100)
  • stream_build_event (103-146)
src/utils/query.py (8)
src/configuration.py (4)
  • configuration (60-64)
  • user_data_collection_configuration (81-85)
  • inference (121-125)
  • llama_stack_configuration (74-78)
src/app/database.py (1)
  • get_session (34-40)
src/models/database/conversations.py (1)
  • UserConversation (11-36)
src/models/requests.py (2)
  • QueryRequest (72-222)
  • Attachment (15-69)
src/models/responses.py (2)
  • UnauthorizedResponse (361-375)
  • ForbiddenResponse (378-390)
src/utils/endpoints.py (3)
  • check_configuration_loaded (62-68)
  • validate_conversation_ownership (39-59)
  • validate_model_provider_override (114-134)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (1)
  • TurnSummary (59-78)
src/utils/streaming_query.py (4)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • append_tool_calls_from_llama (65-78)
src/utils/endpoints.py (1)
  • get_system_prompt (71-111)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/app/endpoints/streaming_query.py (1)
  • response_generator (142-203)
src/app/endpoints/query.py (7)
src/authentication/__init__.py (1)
  • get_auth_dependency (14-43)
src/client.py (1)
  • AsyncLlamaStackClientHolder (18-55)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/models/responses.py (1)
  • QueryResponse (48-79)
src/utils/endpoints.py (2)
  • get_agent (138-187)
  • get_system_prompt (71-111)
src/utils/query.py (8)
  • evaluate_model_hints (94-133)
  • select_model_and_provider_id (136-213)
  • is_input_shield (242-255)
  • is_output_shield (229-239)
  • validate_attachments_metadata (258-287)
  • validate_query_request (290-335)
  • handle_api_connection_error (338-349)
  • process_transcript_and_persist_conversation (352-383)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: e2e_tests
🔇 Additional comments (13)
tests/unit/conftest.py (1)

26-36: LGTM: stable Turn mock prevents “did not complete” errors.

This makes downstream tests resilient to missing output_message.

tests/unit/app/test_routers.py (1)

21-24: LGTM: v2 routers included in import surface.

tests/unit/app/endpoints/test_conversations.py (1)

79-80: LGTM: context‑managed get_session mock is correct.

Matches production usage of get_session() with a context manager.

tests/unit/app/endpoints/test_query_v2.py (2)

57-82: LGTM: no-tools path verifies tools=None and prompt wiring.


118-157: LGTM: output parsing and tool call capture looks robust.

src/utils/query.py (1)

1-384: Repo-wide sweep for configuration accessor misuse — verify and fix accesses.

Found non-call attribute accesses to the configuration object; confirm each is a true attribute (leave) or should be invoked (call) and update accordingly:

  • src/utils/query.py:307 — logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
  • src/lightspeed_stack.py:62 — logger.info("Configuration: %s", configuration.configuration)
  • src/lightspeed_stack.py:64 — logger.info("Llama stack configuration: %s", configuration.llama_stack_configuration)
  • src/app/endpoints/streaming_query.py:100 — logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
  • src/app/endpoints/streaming_query_v2.py:89 — logger.info("Llama stack config: %s", configuration.llama_stack_configuration)

If any referenced symbol is a callable (e.g., an accessor method), replace the attribute access with a call and adjust call sites; otherwise keep as-is.

tests/unit/app/endpoints/test_query.py (2)

95-101: Good: configuration-not-loaded path tested via utils.query.configuration None.

This aligns with the new utils-based organization. No changes needed.


62-64: Mocks now point to centralized utils — nice.

Redirecting ownership and persistence to utils.* keeps tests aligned with the refactor.

src/app/endpoints/conversations_v2.py (2)

32-43: Async placeholders match awaited base signature.

delete_conversation_base awaits the session helpers; keeping these async is correct.


85-101: Correct: don’t await a sync base — base is async now, so awaiting is right.

Past comments to drop await are obsolete given utils.conversations.delete_conversation_base is async.

src/app/endpoints/streaming_query.py (1)

151-186: SSE generation correctly delegates to shared utilities.

Centralization reduces drift between v1/v2. Looks good.

src/app/endpoints/query.py (2)

100-107: Good consolidation: transcript + conversation persistence via utils.

Reduces duplication and keeps v1/v2 behavior aligned.


191-223: Toolgroups assembly is correct and preserves no_tools semantics.

MCP headers are merged; empty toolgroups coerced to None. LGTM.

Comment on lines +186 to +209
# Create OpenAI response using responses API
response = await client.responses.create(
input=query_request.query,
model=model_id,
instructions=system_prompt,
previous_response_id=query_request.conversation_id,
tools=(cast(Any, tools) if tools else cast(Any, None)),
stream=False,
store=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Consider gating Responses API persistence on transcripts setting.

responses.create(store=True) persists data server-side regardless of local transcripts_enabled. Gate or make configurable to avoid unintended storage.

-    response = await client.responses.create(
+    store_resp = configuration.user_data_collection_configuration.transcripts_enabled
+    response = await client.responses.create(
         input=query_request.query,
         model=model_id,
         instructions=system_prompt,
         previous_response_id=query_request.conversation_id,
         tools=(cast(Any, tools) if tools else cast(Any, None)),
         stream=False,
-        store=True,
+        store=store_resp,
     )

Please confirm product intent: should server-side storage follow local transcript policy?


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "PWD: $(pwd)"
echo "---- rg version ----"
rg --version 2>/dev/null || true

echo "---- Search for responses.create / client.responses.create occurrences ----"
rg -n --hidden --no-ignore "responses\.create\(" || true
rg -n --hidden --no-ignore "client\.responses\.create\(" || true

echo "---- Search for 'store=' usages ----"
rg -n --hidden --no-ignore "store\s*=" -S || true

echo "---- Search for transcripts / user_data_collection_configuration ----"
rg -n --hidden --no-ignore "transcripts_enabled|transcripts|user_data_collection_configuration|user_data_collection" -S || true

echo "---- Search for 'configuration.' references ----"
rg -n --hidden --no-ignore "configuration\." -S || true

echo "---- Show src/app/endpoints/query_v2.py (if present) around lines 1-260 ----"
if [ -f src/app/endpoints/query_v2.py ]; then
  sed -n '1,260p' src/app/endpoints/query_v2.py || true
else
  echo "src/app/endpoints/query_v2.py not found"
fi

Length of output: 84095


Gate Responses API persistence on transcripts setting

responses.create(store=True) persists data server-side regardless of local transcripts_enabled. Calls found in src/app/endpoints/query_v2.py (≈lines 186–195) and src/app/endpoints/streaming_query_v2.py (≈lines 388–395) should pass store=configuration.user_data_collection_configuration.transcripts_enabled (or use utils.query.is_transcripts_enabled()) to avoid unintended remote storage.

-    response = await client.responses.create(
+    store_resp = configuration.user_data_collection_configuration.transcripts_enabled
+    response = await client.responses.create(
         input=query_request.query,
         model=model_id,
         instructions=system_prompt,
         previous_response_id=query_request.conversation_id,
         tools=(cast(Any, tools) if tools else cast(Any, None)),
         stream=False,
-        store=True,
+        store=store_resp,
     )

Confirm product intent: should server-side storage follow local transcript policy?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Create OpenAI response using responses API
response = await client.responses.create(
input=query_request.query,
model=model_id,
instructions=system_prompt,
previous_response_id=query_request.conversation_id,
tools=(cast(Any, tools) if tools else cast(Any, None)),
stream=False,
store=True,
)
# Create OpenAI response using responses API
store_resp = configuration.user_data_collection_configuration.transcripts_enabled
response = await client.responses.create(
input=query_request.query,
model=model_id,
instructions=system_prompt,
previous_response_id=query_request.conversation_id,
tools=(cast(Any, tools) if tools else cast(Any, None)),
stream=False,
store=store_resp,
)
🤖 Prompt for AI Agents
In src/app/endpoints/query_v2.py around lines 186 to 195, the call to
client.responses.create currently sets store=True which forces server-side
persistence regardless of local transcript settings; change the store argument
to respect the app configuration by using
configuration.user_data_collection_configuration.transcripts_enabled (or
utils.query.is_transcripts_enabled() if preferred) so the call becomes
store=configuration.user_data_collection_configuration.transcripts_enabled, and
ensure configuration is in scope or import the utils helper if using it.

Comment on lines +88 to +90
# log Llama Stack configuration
logger.info("Llama stack config: %s", configuration.llama_stack_configuration)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t log entire config; call accessor and log safe fields.

Same concern as in utils.query.

Apply:

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    cfg = configuration.llama_stack_configuration()
+    logger.info(
+        "Llama Stack version: %s; providers: %s",
+        getattr(cfg, "version", "n/a"),
+        getattr(cfg, "providers", "n/a"),
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# log Llama Stack configuration
logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
# log Llama Stack configuration
cfg = configuration.llama_stack_configuration()
logger.info(
"Llama Stack version: %s; providers: %s",
getattr(cfg, "version", "n/a"),
getattr(cfg, "providers", "n/a"),
)
🤖 Prompt for AI Agents
In src/app/endpoints/streaming_query_v2.py around lines 88 to 90, avoid logging
the entire configuration object; instead call the accessor method that returns
only safe-to-log fields and log those specific fields (e.g., model name,
temperature, and any non-sensitive flags). Replace the direct logger.info of
configuration.llama_stack_configuration with a call like
configuration.get_safe_llama_stack_config() (or the existing accessor used in
utils.query) and log the returned safe dict/values so no secrets or sensitive
internal state are emitted.

Comment on lines +116 to +119
llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
await client.models.list(),
*evaluate_model_hints(user_conversation=None, query_request=query_request),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

BUG: evaluate_model_hints ignores validated conversation context.

You pass user_conversation=None instead of the resolved user_conversation.

Apply:

-            *evaluate_model_hints(user_conversation=None, query_request=query_request),
+            *evaluate_model_hints(
+                user_conversation=user_conversation, query_request=query_request
+            ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
await client.models.list(),
*evaluate_model_hints(user_conversation=None, query_request=query_request),
)
llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
await client.models.list(),
*evaluate_model_hints(
user_conversation=user_conversation, query_request=query_request
),
)
🤖 Prompt for AI Agents
In src/app/endpoints/streaming_query_v2.py around lines 116 to 119,
evaluate_model_hints is being called with user_conversation=None which ignores
the validated/resolved conversation context; replace the None with the actual
resolved user_conversation variable (the one validated/constructed earlier in
this function) so evaluate_model_hints receives the real conversation, ensuring
model selection uses the correct context.

Comment on lines +379 to +381
mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
if mcp_tools:
tools.extend(mcp_tools)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

BUG: call configuration.mcp_servers() accessor.

Attribute access will fail; also aligns with other config accessors.

Apply:

-        mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
+        mcp_tools = get_mcp_tools(configuration.mcp_servers(), token)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
if mcp_tools:
tools.extend(mcp_tools)
mcp_tools = get_mcp_tools(configuration.mcp_servers(), token)
if mcp_tools:
tools.extend(mcp_tools)
🤖 Prompt for AI Agents
In src/app/endpoints/streaming_query_v2.py around lines 379 to 381, the code
incorrectly accesses configuration.mcp_servers as an attribute; change it to
call the accessor configuration.mcp_servers() to match other config accessors
and avoid an attribute access error — call
get_mcp_tools(configuration.mcp_servers(), token) and keep the conditional and
tools.extend logic the same.

src/constants.py Outdated
# Minimal and maximal supported Llama Stack version
MINIMAL_SUPPORTED_LLAMA_STACK_VERSION = "0.2.17"
MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION = "0.2.20"
MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION = "0.2.21"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Confirm upstream compatibility for 0.2.21 before merge.

Bumping MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION impacts gating and client initialization. Please verify upstream supports 0.2.21 across your deployment matrix and adjust docs/tests if needed.

Run to sanity‑check version gates and any hardcoded expectations:


🏁 Script executed:

#!/bin/bash
# Find version guards and hardcoded mentions
rg -nP 'MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION|MINIMAL_SUPPORTED_LLAMA_STACK_VERSION|0\.2\.2[0-9]' -C2

Length of output: 9300


Confirm upstream compatibility for llama-stack 0.2.21 and align repository pins/docs/tests

MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION was bumped to "0.2.21" but the repo currently pins/mentions 0.2.20 — resolve before merge by either upgrading pinned deps/lock/docs/tests to 0.2.21 (and regenerating uv.lock) if upstream actually provides 0.2.21, or revert the constant to "0.2.20".

  • Affected locations (update or verify): pyproject.toml, uv.lock, examples/pyproject.llamastack.toml, README.md, docs/getting_started.md, docs/deployment_guide.md, docs/output.md, docs/openapi.md/openapi.json, src/models/responses.py, tests/e2e/features/info.feature, tests/unit/utils/test_llama_stack_version.py.
🤖 Prompt for AI Agents
In src/constants.py around line 5 the MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION was
changed to "0.2.21" but the repo still pins/mentions "0.2.20"; either update all
repository pins, locks, docs, tests and examples to 0.2.21 (pyproject.toml,
uv.lock, examples/pyproject.llamastack.toml, README.md, docs/* including
getting_started.md, deployment_guide.md, output.md, openapi.md/openapi.json,
src/models/responses.py, tests/e2e/features/info.feature,
tests/unit/utils/test_llama_stack_version.py) and regenerate uv.lock, or revert
this constant back to "0.2.20" so pins/docs/tests remain consistent before
merging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/app/endpoints/streaming_query.py (1)

350-354: Call configuration accessor for MCP servers.

Fix attribute vs method usage in both places.

-            for mcp_server in configuration.mcp_servers:
+            for mcp_server in configuration.mcp_servers():
                 mcp_headers[mcp_server.url] = {
                     "Authorization": f"Bearer {token}",
                 }
@@
-        toolgroups = (get_rag_toolgroups(vector_db_ids) or []) + [
-            mcp_server.name for mcp_server in configuration.mcp_servers
+        toolgroups = (get_rag_toolgroups(vector_db_ids) or []) + [
+            mcp_server.name for mcp_server in configuration.mcp_servers()
         ]

Also applies to: 367-368

🧹 Nitpick comments (15)
src/app/endpoints/conversations_v2.py (2)

27-27: Use module logger name.

Prefer module-scoped logger for consistency.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)

55-57: Annotate auth dependency with AuthTuple.

Improves type clarity and matches project conventions.

-    auth: Any = Depends(auth_dependency),
+    auth: Annotated[AuthTuple, Depends(auth_dependency)],

Add missing import at top of file:

from authentication.interface import AuthTuple

Also applies to: 65-68, 90-93

src/app/endpoints/query.py (1)

47-47: Use module logger name.

Align with logging guideline.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)
src/utils/query.py (1)

30-30: Use module logger name.

Consistent logging naming.

-logger = logging.getLogger("utils.query")
+logger = logging.getLogger(__name__)
src/app/endpoints/query_v2.py (1)

38-38: Use module logger name.

Consistent logging naming.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)
tests/unit/app/endpoints/test_conversations.py (1)

24-24: Use shared MOCK_AUTH constant.

Tests should use the shared auth mock for consistency.

-MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token")
+# Prefer: from tests.unit.conftest import MOCK_AUTH
src/app/endpoints/streaming_query_v2.py (1)

52-52: Use module logger name.

Consistent logging naming.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)
src/app/endpoints/streaming_query.py (3)

141-142: Avoid logging full config; log safe fields and call accessor.

Prevents secrets leakage.

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    cfg = configuration.llama_stack_configuration()
+    logger.info(
+        "Llama Stack version: %s; providers: %s",
+        getattr(cfg, "version", "n/a"),
+        getattr(cfg, "providers", "n/a"),
+    )

256-256: Set SSE media type and headers.

Improves client compatibility and reduces buffering.

-        return StreamingResponse(response_generator(response))
+        return StreamingResponse(
+            response_generator(response),
+            media_type="text/event-stream",
+            headers={
+                "Cache-Control": "no-cache",
+                "Connection": "keep-alive",
+                "X-Accel-Buffering": "no",
+            },
+        )

58-58: Use module logger name.

Consistent logging naming.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)
src/utils/streaming_query.py (1)

21-21: Use module logger name.

Consistent logging naming.

-logger = logging.getLogger("utils.streaming_query")
+logger = logging.getLogger(__name__)
src/app/endpoints/conversations.py (1)

32-32: Use module-standard logger name

Prefer name per guidelines for consistent logger hierarchy.

Apply this diff:

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)
src/utils/conversations.py (3)

239-244: Clarify error message for non-UUID SUIDs

check_suid accepts resp‑* IDs; message should not claim “not a valid UUID”.

Apply this diff:

-            detail={
-                "response": "Invalid conversation ID format",
-                "cause": f"Conversation ID {conversation_id} is not a valid UUID",
-            },
+            detail={
+                "response": "Invalid conversation ID format",
+                "cause": f"Conversation ID {conversation_id} is not a valid conversation ID format",
+            },

345-346: Avoid blocking in async path: run local DB delete in threadpool

delete_conversation performs DB I/O; run it off the event loop.

Apply this diff:

-from fastapi import HTTPException, Request, status
+from fastapi import HTTPException, Request, status
+from starlette.concurrency import run_in_threadpool
@@
-        delete_conversation(conversation_id=conversation_id)
+        await run_in_threadpool(delete_conversation, conversation_id=conversation_id)

Also applies to: 8-12


28-28: Use module-standard logger name

Align with guideline to use name.

Apply this diff:

-logger = logging.getLogger("utils.conversations")
+logger = logging.getLogger(__name__)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61301f7 and d16d834.

📒 Files selected for processing (18)
  • src/app/endpoints/conversations.py (7 hunks)
  • src/app/endpoints/conversations_v2.py (1 hunks)
  • src/app/endpoints/query.py (5 hunks)
  • src/app/endpoints/query_v2.py (1 hunks)
  • src/app/endpoints/streaming_query.py (2 hunks)
  • src/app/endpoints/streaming_query_v2.py (1 hunks)
  • src/app/routers.py (3 hunks)
  • src/constants.py (1 hunks)
  • src/utils/conversations.py (1 hunks)
  • src/utils/query.py (1 hunks)
  • src/utils/streaming_query.py (1 hunks)
  • src/utils/suid.py (1 hunks)
  • tests/unit/app/endpoints/test_conversations.py (14 hunks)
  • tests/unit/app/endpoints/test_query.py (7 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (1 hunks)
  • tests/unit/app/endpoints/test_streaming_query_v2.py (1 hunks)
  • tests/unit/app/test_routers.py (5 hunks)
  • tests/unit/conftest.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/test_routers.py
  • tests/unit/app/endpoints/test_query.py
  • src/app/routers.py
  • src/utils/suid.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • src/constants.py
  • tests/unit/conftest.py
🧰 Additional context used
📓 Path-based instructions (7)
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use absolute imports for internal modules (e.g., from auth import get_auth_dependency)

Files:

  • src/app/endpoints/conversations_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/query.py
  • src/utils/query.py
  • src/utils/streaming_query.py
  • src/utils/conversations.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/conversations.py
src/app/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use standard FastAPI imports (from fastapi import APIRouter, HTTPException, Request, status, Depends) in FastAPI app code

Files:

  • src/app/endpoints/conversations_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/conversations.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: All modules start with descriptive module-level docstrings explaining purpose
Use logger = logging.getLogger(name) for module logging after import logging
Define type aliases at module level for clarity
All functions require docstrings with brief descriptions
Provide complete type annotations for all function parameters and return types
Use typing_extensions.Self in model validators where appropriate
Use modern union syntax (str | int) and Optional[T] or T | None consistently
Function names use snake_case with descriptive, action-oriented prefixes (get_, validate_, check_)
Avoid in-place parameter modification; return new data structures instead of mutating arguments
Use appropriate logging levels: debug, info, warning, error with clear messages
All classes require descriptive docstrings explaining purpose
Class names use PascalCase with conventional suffixes (Configuration, Error/Exception, Resolver, Interface)
Abstract base classes should use abc.ABC and @AbstractMethod for interfaces
Provide complete type annotations for all class attributes
Follow Google Python docstring style for modules, classes, and functions, including Args, Returns, Raises, Attributes sections as needed

Files:

  • src/app/endpoints/conversations_v2.py
  • src/app/endpoints/streaming_query.py
  • tests/unit/app/endpoints/test_conversations.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/query.py
  • src/utils/query.py
  • src/utils/streaming_query.py
  • src/utils/conversations.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/conversations.py
src/{app/**/*.py,client.py}

📄 CodeRabbit inference engine (CLAUDE.md)

Use async def for I/O-bound operations and external API calls

Files:

  • src/app/endpoints/conversations_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/conversations.py
src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

In API endpoints, raise FastAPI HTTPException with appropriate status codes for error handling

Files:

  • src/app/endpoints/conversations_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/conversations.py
tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests
Do not use unittest in tests; pytest is the standard

Files:

  • tests/unit/app/endpoints/test_conversations.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest-mock to create AsyncMock objects for async interactions in tests
Use the shared auth mock constant: MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") in tests

Files:

  • tests/unit/app/endpoints/test_conversations.py
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
PR: lightspeed-core/lightspeed-stack#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to src/client.py : Handle Llama Stack APIConnectionError when interacting with the Llama client

Applied to files:

  • tests/unit/app/endpoints/test_conversations.py
🧬 Code graph analysis (9)
src/app/endpoints/conversations_v2.py (5)
src/authentication/__init__.py (1)
  • get_auth_dependency (14-43)
src/authorization/middleware.py (1)
  • authorize (111-122)
src/models/config.py (2)
  • config (138-144)
  • Action (311-350)
src/models/responses.py (3)
  • ConversationDeleteResponse (443-476)
  • ConversationResponse (393-440)
  • ConversationsListResponse (540-597)
src/utils/conversations.py (4)
  • get_conversations_list_base (175-231)
  • delete_conversation_base (284-382)
  • validate_conversation_id (234-244)
  • validate_conversation_access (247-281)
src/app/endpoints/streaming_query.py (4)
src/app/endpoints/query.py (1)
  • get_rag_toolgroups (266-293)
src/utils/endpoints.py (5)
  • check_configuration_loaded (63-69)
  • get_agent (139-192)
  • get_system_prompt (72-112)
  • validate_model_provider_override (115-135)
  • validate_conversation_ownership (40-60)
src/utils/query.py (7)
  • is_input_shield (242-255)
  • is_output_shield (229-239)
  • is_transcripts_enabled (55-61)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
  • persist_user_conversation_details (64-91)
  • evaluate_model_hints (94-133)
src/utils/streaming_query.py (3)
  • stream_start_event (40-61)
  • stream_end_event (64-100)
  • stream_build_event (104-146)
src/app/endpoints/query_v2.py (9)
src/authentication/__init__.py (1)
  • get_auth_dependency (14-43)
src/authorization/middleware.py (1)
  • authorize (111-122)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/models/responses.py (1)
  • QueryResponse (48-79)
src/utils/endpoints.py (1)
  • get_system_prompt (72-112)
src/utils/mcp_headers.py (1)
  • mcp_headers_dependency (15-26)
src/utils/query.py (5)
  • evaluate_model_hints (94-133)
  • process_transcript_and_persist_conversation (357-388)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
  • validate_query_request (290-340)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/query.py (3)
src/client.py (1)
  • AsyncLlamaStackClientHolder (18-55)
src/utils/endpoints.py (1)
  • get_agent (139-192)
src/utils/query.py (8)
  • evaluate_model_hints (94-133)
  • select_model_and_provider_id (136-213)
  • is_input_shield (242-255)
  • is_output_shield (229-239)
  • validate_attachments_metadata (258-287)
  • validate_query_request (290-340)
  • handle_api_connection_error (343-354)
  • process_transcript_and_persist_conversation (357-388)
src/utils/query.py (9)
src/configuration.py (4)
  • configuration (65-69)
  • user_data_collection_configuration (86-90)
  • inference (126-130)
  • llama_stack_configuration (79-83)
src/app/database.py (1)
  • get_session (34-40)
src/models/config.py (2)
  • config (138-144)
  • Action (311-350)
src/models/database/conversations.py (1)
  • UserConversation (11-36)
src/models/requests.py (2)
  • QueryRequest (72-222)
  • Attachment (15-69)
src/models/responses.py (2)
  • UnauthorizedResponse (361-375)
  • ForbiddenResponse (378-390)
src/utils/endpoints.py (3)
  • check_configuration_loaded (63-69)
  • validate_conversation_ownership (40-60)
  • validate_model_provider_override (115-135)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (1)
  • TurnSummary (59-78)
src/utils/streaming_query.py (3)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • append_tool_calls_from_llama (65-78)
src/utils/endpoints.py (1)
  • get_system_prompt (72-112)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/utils/conversations.py (7)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/app/database.py (1)
  • get_session (34-40)
src/models/config.py (2)
  • config (138-144)
  • Action (311-350)
src/models/database/conversations.py (1)
  • UserConversation (11-36)
src/models/responses.py (4)
  • ConversationDeleteResponse (443-476)
  • ConversationsListResponse (540-597)
  • ConversationDetails (479-537)
  • UnauthorizedResponse (361-375)
src/utils/endpoints.py (3)
  • check_configuration_loaded (63-69)
  • delete_conversation (23-37)
  • validate_conversation_ownership (40-60)
src/utils/suid.py (1)
  • check_suid (15-44)
src/app/endpoints/streaming_query_v2.py (9)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/configuration.py (3)
  • configuration (65-69)
  • llama_stack_configuration (79-83)
  • mcp_servers (93-97)
src/models/requests.py (1)
  • QueryRequest (72-222)
src/utils/endpoints.py (4)
  • check_configuration_loaded (63-69)
  • get_system_prompt (72-112)
  • validate_model_provider_override (115-135)
  • validate_conversation_ownership (40-60)
src/utils/query.py (5)
  • evaluate_model_hints (94-133)
  • is_transcripts_enabled (55-61)
  • persist_user_conversation_details (64-91)
  • select_model_and_provider_id (136-213)
  • validate_attachments_metadata (258-287)
src/utils/streaming_query.py (3)
  • format_stream_data (26-37)
  • stream_start_event (40-61)
  • stream_end_event (64-100)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (2)
  • TurnSummary (59-78)
  • ToolCallSummary (43-56)
src/app/endpoints/query_v2.py (2)
  • get_rag_tools (298-309)
  • get_mcp_tools (312-328)
src/app/endpoints/conversations.py (6)
src/configuration.py (1)
  • configuration (65-69)
src/models/config.py (2)
  • config (138-144)
  • Action (311-350)
src/models/responses.py (3)
  • ConversationResponse (393-440)
  • ConversationDeleteResponse (443-476)
  • ConversationsListResponse (540-597)
src/utils/endpoints.py (1)
  • check_configuration_loaded (63-69)
src/utils/conversations.py (5)
  • simplify_session_data (133-172)
  • get_conversations_list_base (175-231)
  • validate_conversation_id (234-244)
  • validate_conversation_access (247-281)
  • delete_conversation_base (284-382)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: e2e_tests
🔇 Additional comments (15)
src/app/endpoints/conversations_v2.py (3)

32-43: Resolved: helpers are correctly async now (previous sync suggestion is obsolete).

get_response_sessions aligns with async delete_conversation_base and is awaited at call site.


45-49: Resolved: deletion helper is correctly async.

Matches the awaited usage and async base signature.


95-101: Resolved: using await on async base is correct.

delete_conversation_base is async in utils; keeping await avoids TypeError.

src/utils/query.py (3)

307-308: Avoid logging full config; log safe fields and call accessor.

Prevents secrets leakage and fixes accessor usage.

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    cfg = configuration.llama_stack_configuration()
+    logger.info(
+        "Llama Stack version: %s; providers: %s",
+        getattr(cfg, "version", "n/a"),
+        getattr(cfg, "providers", "n/a"),
+    )

136-214: Fix model/provider selection: wrong config access and wrong early return.

  • Call configuration.inference() accessor.
  • Return (llama_stack_model_id, model_id, provider_id) when defaulting to first LLM.
 def select_model_and_provider_id(
     models: ModelListResponse, model_id: str | None, provider_id: str | None
 ) -> tuple[str, str, str]:
@@
-    if not model_id or not provider_id:
+    if not model_id or not provider_id:
         logger.debug(
             "No model ID or provider ID specified in request, checking configuration"
         )
-        model_id = configuration.inference.default_model  # type: ignore[reportAttributeAccessIssue]
-        provider_id = (
-            configuration.inference.default_provider  # type: ignore[reportAttributeAccessIssue]
-        )
+        model_id = configuration.inference().default_model
+        provider_id = configuration.inference().default_provider
@@
-    if not model_id or not provider_id:
+    if not model_id or not provider_id:
         logger.debug(
             "No model ID or provider ID specified in request or configuration, "
             "using the first available LLM"
         )
         try:
             model = next(
                 m
                 for m in models
                 if m.model_type == "llm"  # pyright: ignore[reportAttributeAccessIssue]
             )
-            model_id = model.identifier
-            provider_id = model.provider_id
-            logger.info("Selected model: %s", model)
-            return model_id, model_id, provider_id
+            llama_stack_model_id = model.identifier
+            provider_id = model.provider_id
+            # derive plain model id if identifier is "provider/model"
+            model_id = (
+                llama_stack_model_id.split("/", 1)[1]
+                if "/" in llama_stack_model_id
+                else llama_stack_model_id
+            )
+            logger.info("Selected model: %s", llama_stack_model_id)
+            return llama_stack_model_id, model_id, provider_id
@@
-    llama_stack_model_id = f"{provider_id}/{model_id}"
+    llama_stack_model_id = f"{provider_id}/{model_id}"
@@
-    return llama_stack_model_id, model_id, provider_id
+    return llama_stack_model_id, model_id, provider_id

55-61: Bug: configuration accessor not called (runtime AttributeError).

user_data_collection_configuration is a method.

-def is_transcripts_enabled() -> bool:
+def is_transcripts_enabled() -> bool:
@@
-    return configuration.user_data_collection_configuration.transcripts_enabled
+    return configuration.user_data_collection_configuration().transcripts_enabled
src/app/endpoints/streaming_query_v2.py (4)

88-90: Avoid logging full config; log safe fields and call accessor.

Prevents secrets leakage.

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    cfg = configuration.llama_stack_configuration()
+    logger.info(
+        "Llama Stack version: %s; providers: %s",
+        getattr(cfg, "version", "n/a"),
+        getattr(cfg, "providers", "n/a"),
+    )

306-309: Set proper SSE media type and headers.

Ensures clients process events without buffering.

-        return StreamingResponse(response_generator(response))
+        return StreamingResponse(
+            response_generator(response),
+            media_type="text/event-stream",
+            headers={
+                "Cache-Control": "no-cache",
+                "Connection": "keep-alive",
+                "X-Accel-Buffering": "no",
+            },
+        )

116-119: Pass the resolved conversation to evaluate_model_hints.

Currently ignores validated conversation context.

-            *evaluate_model_hints(user_conversation=None, query_request=query_request),
+            *evaluate_model_hints(
+                user_conversation=user_conversation, query_request=query_request
+            ),

379-381: Call configuration accessor for MCP servers.

Fix attribute vs method usage.

-        mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
+        mcp_tools = get_mcp_tools(configuration.mcp_servers(), token)
src/utils/streaming_query.py (1)

123-126: Short-circuit on error chunks to avoid AttributeError.

Don’t access chunk.event after handling an error chunk.

-    if hasattr(chunk, "error"):
-        yield from _handle_error_event(chunk, chunk_id)
+    if hasattr(chunk, "error"):
+        yield from _handle_error_event(chunk, chunk_id)
+        return
src/app/endpoints/query_v2.py (1)

194-195: Gate server-side persistence to transcripts setting.

Avoid unintended storage when transcripts are disabled.

-        stream=False,
-        store=True,
+        stream=False,
+        store=configuration.user_data_collection_configuration().transcripts_enabled,
src/app/endpoints/conversations.py (3)

103-105: Don’t call .get on possibly typed session objects; extract session_id safely

Agent SDK may return typed objects; .get will raise. Handle dict/attr and fail fast if missing.

Apply this diff:

-        session_id = str(agent_sessions[0].get("session_id"))
+        first = agent_sessions[0]
+        session_id = (
+            first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+        )
+        if not session_id:
+            raise HTTPException(
+                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                detail={"response": "session_id missing on session"},
+            )
+        session_id = str(session_id)

44-44: Run sync base in a threadpool to avoid blocking the event loop

get_conversations_list_base does DB I/O and is sync; calling it directly from an async endpoint will block.

Apply this diff:

-from fastapi import APIRouter, HTTPException, Request, status, Depends
+from fastapi import APIRouter, HTTPException, Request, status, Depends
+from starlette.concurrency import run_in_threadpool
@@
-    return get_conversations_list_base(request, auth)
+    return await run_in_threadpool(get_conversations_list_base, request, auth)

Also applies to: 8-8


152-156: Broaden helper typings and fix session_id extraction for delete path

Avoid assuming dicts; accept list[Any] and handle dict/attr for session_id.

Apply this diff:

-async def get_agent_sessions(client: Any, conversation_id: str) -> list[dict[str, Any]]:
+async def get_agent_sessions(client: Any, conversation_id: str) -> list[Any]:
@@
-async def delete_agent_sessions(
-    client: Any, conversation_id: str, sessions: list[dict[str, Any]]
-) -> None:
+async def delete_agent_sessions(
+    client: Any, conversation_id: str, sessions: list[Any]
+) -> None:
@@
-    session_id = str(sessions[0].get("session_id"))
+    first = sessions[0]
+    session_id = (
+        first.get("session_id") if isinstance(first, dict) else getattr(first, "session_id", None)
+    )
+    if not session_id:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail={"response": "session_id missing on session"},
+        )
+    session_id = str(session_id)

Also applies to: 158-164

Comment on lines +177 to +193
mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
if mcp_tools:
tools.extend(mcp_tools)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Call configuration accessor for MCP servers.

mcp_servers is a method.

-        mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
+        mcp_tools = get_mcp_tools(configuration.mcp_servers(), token)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mcp_tools = get_mcp_tools(configuration.mcp_servers, token)
if mcp_tools:
tools.extend(mcp_tools)
mcp_tools = get_mcp_tools(configuration.mcp_servers(), token)
if mcp_tools:
tools.extend(mcp_tools)
🤖 Prompt for AI Agents
In src/app/endpoints/query_v2.py around lines 177 to 179, the code treats
configuration.mcp_servers as an attribute but it is a method; call
configuration.mcp_servers() and pass its return value to get_mcp_tools (i.e.,
replace configuration.mcp_servers with configuration.mcp_servers()), ensuring
any required arguments or async/await usage are respected if the accessor is
async.

Comment on lines +388 to +396
response = await client.responses.create(
input=query_request.query,
model=model_id,
instructions=system_prompt,
previous_response_id=query_request.conversation_id,
tools=(cast(Any, tools) if tools else cast(Any, None)),
stream=True,
store=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Gate server-side persistence to transcripts setting.

Avoid persisting when transcripts are disabled.

-    response = await client.responses.create(
+    response = await client.responses.create(
         input=query_request.query,
         model=model_id,
         instructions=system_prompt,
         previous_response_id=query_request.conversation_id,
         tools=(cast(Any, tools) if tools else cast(Any, None)),
         stream=True,
-        store=True,
+        store=configuration.user_data_collection_configuration().transcripts_enabled,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = await client.responses.create(
input=query_request.query,
model=model_id,
instructions=system_prompt,
previous_response_id=query_request.conversation_id,
tools=(cast(Any, tools) if tools else cast(Any, None)),
stream=True,
store=True,
)
response = await client.responses.create(
input=query_request.query,
model=model_id,
instructions=system_prompt,
previous_response_id=query_request.conversation_id,
tools=(cast(Any, tools) if tools else cast(Any, None)),
stream=True,
store=configuration.user_data_collection_configuration().transcripts_enabled,
)
🤖 Prompt for AI Agents
In src/app/endpoints/streaming_query_v2.py around lines 388 to 396, the call to
client.responses.create always sets store=True which persists responses even
when server-side transcripts are disabled; change this to gate persistence
behind the transcripts/analytics setting (e.g. use the app config or request
flag that indicates transcripts are enabled) and pass store=(True if
transcripts_enabled else False) (or omit the store parameter when disabled) so
responses are only persisted when the transcripts feature is enabled.

Comment on lines 68 to 73
200: {
"conversation_id": "123e4567-e89b-12d3-a456-426614174000",
"success": True,
"message": "Conversation deleted successfully",
},
400: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

OpenAPI schema mismatch: use response field (not message)

ConversationDeleteResponse uses response; docs example shows message.

Apply this diff:

 conversation_delete_responses: dict[int | str, dict[str, Any]] = {
         "conversation_id": "123e4567-e89b-12d3-a456-426614174000",
         "success": True,
-        "message": "Conversation deleted successfully",
+        "response": "Conversation deleted successfully",
     },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
200: {
"conversation_id": "123e4567-e89b-12d3-a456-426614174000",
"success": True,
"message": "Conversation deleted successfully",
},
400: {
200: {
"conversation_id": "123e4567-e89b-12d3-a456-426614174000",
"success": True,
"response": "Conversation deleted successfully",
},
400: {
🤖 Prompt for AI Agents
In src/utils/conversations.py around lines 68 to 73, the OpenAPI example for the
200 response uses the key "message" but the ConversationDeleteResponse schema
expects "response"; update the example to replace the "message" field with
"response" (e.g., "response": "Conversation deleted successfully") so the docs
match the schema.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
src/utils/suid.py (2)

21-25: Docstring incorrect for Responses IDs.

States “follows a valid UUID string” but we should treat resp-* as opaque; update wording accordingly.

Apply:

-    Returns True if the string is a valid UUID or if it starts with resp-/resp_
-    and it follows a valid UUID string, False otherwise.
+    Returns True if the string is a valid UUID, or if it starts with resp-/resp_
+    and contains a non-empty opaque token, False otherwise.

26-33: Parameter type mismatch in docstring.

Docstring says str | bytes, function accepts str. Align docstring (or add bytes support by decoding).

Apply either:

  • Update docstring to str only, or
  • Accept bytes:
-def check_suid(suid: str) -> bool:
+def check_suid(suid: str | bytes) -> bool:
@@
-    if not isinstance(suid, str) or not suid:
+    if isinstance(suid, (bytes, bytearray)):
+        try:
+            suid = suid.decode("utf-8")
+        except Exception:
+            return False
+    if not isinstance(suid, str) or not suid:
         return False
src/app/routers.py (1)

32-41: Section label/order mismatch.

conversations_v2 is mounted under /v2 but appears in the “V1 endpoints” block. Move it under the V2 section or fix the comment to avoid confusion.

tests/unit/app/test_routers.py (2)

65-80: Reduce brittleness: avoid asserting exact router count.

Exact counts break when routes evolve. Prefer presence checks against an expected set.

Example:

-    assert len(app.routers) == 15
+    # keep flexible: assert presence instead of exact count

83-104: Add coverage for /v3 conversations mount.

conversations.router is also included under /v3; extend the test to validate that additional prefix to prevent regressions.

You can add a helper that returns all prefixes for a router and assert "/v1" in prefixes and "/v3" in prefixes.

src/app/endpoints/streaming_query_v2.py (2)

52-53: Use module logger name for consistency.

Prefer getLogger(name) per guidelines.

-logger = logging.getLogger("app.endpoints.handlers")
+logger = logging.getLogger(__name__)

365-387: Optional: propagate per-server MCP headers.

You accept mcp_headers but don’t use them when building tools; consider merging headers for each server.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d16d834 and b9ae829.

📒 Files selected for processing (8)
  • src/app/endpoints/query_v2.py (1 hunks)
  • src/app/endpoints/streaming_query_v2.py (1 hunks)
  • src/app/routers.py (3 hunks)
  • src/constants.py (1 hunks)
  • src/utils/suid.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (1 hunks)
  • tests/unit/app/endpoints/test_streaming_query_v2.py (1 hunks)
  • tests/unit/app/test_routers.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • src/app/endpoints/query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
🧰 Additional context used
📓 Path-based instructions (8)
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use absolute imports for internal modules (e.g., from auth import get_auth_dependency)

Files:

  • src/app/routers.py
  • src/utils/suid.py
  • src/app/endpoints/streaming_query_v2.py
  • src/constants.py
src/app/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use standard FastAPI imports (from fastapi import APIRouter, HTTPException, Request, status, Depends) in FastAPI app code

Files:

  • src/app/routers.py
  • src/app/endpoints/streaming_query_v2.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: All modules start with descriptive module-level docstrings explaining purpose
Use logger = logging.getLogger(name) for module logging after import logging
Define type aliases at module level for clarity
All functions require docstrings with brief descriptions
Provide complete type annotations for all function parameters and return types
Use typing_extensions.Self in model validators where appropriate
Use modern union syntax (str | int) and Optional[T] or T | None consistently
Function names use snake_case with descriptive, action-oriented prefixes (get_, validate_, check_)
Avoid in-place parameter modification; return new data structures instead of mutating arguments
Use appropriate logging levels: debug, info, warning, error with clear messages
All classes require descriptive docstrings explaining purpose
Class names use PascalCase with conventional suffixes (Configuration, Error/Exception, Resolver, Interface)
Abstract base classes should use abc.ABC and @AbstractMethod for interfaces
Provide complete type annotations for all class attributes
Follow Google Python docstring style for modules, classes, and functions, including Args, Returns, Raises, Attributes sections as needed

Files:

  • src/app/routers.py
  • src/utils/suid.py
  • src/app/endpoints/streaming_query_v2.py
  • src/constants.py
  • tests/unit/app/test_routers.py
src/{app/**/*.py,client.py}

📄 CodeRabbit inference engine (CLAUDE.md)

Use async def for I/O-bound operations and external API calls

Files:

  • src/app/routers.py
  • src/app/endpoints/streaming_query_v2.py
src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

In API endpoints, raise FastAPI HTTPException with appropriate status codes for error handling

Files:

  • src/app/endpoints/streaming_query_v2.py
src/constants.py

📄 CodeRabbit inference engine (CLAUDE.md)

Keep shared constants in a central src/constants.py with descriptive comments

Files:

  • src/constants.py
tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests
Do not use unittest in tests; pytest is the standard

Files:

  • tests/unit/app/test_routers.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest-mock to create AsyncMock objects for async interactions in tests
Use the shared auth mock constant: MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") in tests

Files:

  • tests/unit/app/test_routers.py
🧬 Code graph analysis (2)
src/app/routers.py (1)
tests/unit/app/test_routers.py (1)
  • include_router (34-49)
src/app/endpoints/streaming_query_v2.py (8)
src/app/endpoints/query.py (5)
  • evaluate_model_hints (130-169)
  • is_transcripts_enabled (91-97)
  • persist_user_conversation_details (100-127)
  • select_model_and_provider_id (352-430)
  • validate_attachments_metadata (688-717)
src/utils/endpoints.py (4)
  • validate_conversation_ownership (41-61)
  • check_configuration_loaded (64-70)
  • get_system_prompt (73-113)
  • validate_model_provider_override (116-136)
src/app/endpoints/query_v2.py (2)
  • get_rag_tools (313-324)
  • get_mcp_tools (327-343)
src/app/endpoints/streaming_query.py (3)
  • format_stream_data (105-116)
  • stream_start_event (119-140)
  • stream_end_event (143-179)
src/client.py (2)
  • AsyncLlamaStackClientHolder (18-55)
  • get_client (49-55)
src/utils/mcp_headers.py (1)
  • mcp_headers_dependency (15-26)
src/utils/transcripts.py (1)
  • store_transcript (33-86)
src/utils/types.py (2)
  • TurnSummary (89-163)
  • ToolCallSummary (73-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: e2e_tests
🔇 Additional comments (8)
src/constants.py (1)

5-5: Align 0.2.22 bump across pins/docs/tests (or revert).

Make sure upstream 0.2.22 is available and all repo references match; otherwise gate checks and tests will drift. This was flagged earlier for 0.2.21.

Run:

#!/bin/bash
# Find version constants and hardcoded mentions
rg -nP -C2 'MINIMAL_SUPPORTED_LLAMA_STACK_VERSION|MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION|0\.2\.(1[7-9]|2[0-9])'

# Check common pins/mentions
fd -a 'pyproject.toml|uv.lock|README.md|**/*.md|**/*.toml|**/*.json' \
  | xargs rg -nP 'llama[-_ ]?stack|0\.2\.(1[7-9]|2[0-9])' -C2
src/app/endpoints/streaming_query_v2.py (5)

379-381: Remove incorrect accessor change
mcp_servers is defined as a property, so configuration.mcp_servers (not configuration.mcp_servers()) is correct and needs no change.

Likely an incorrect or invalid review comment.


116-119: BUG: model selection ignores resolved conversation context.

Pass the validated user_conversation instead of None.

-            *evaluate_model_hints(user_conversation=None, query_request=query_request),
+            *evaluate_model_hints(
+                user_conversation=user_conversation, query_request=query_request
+            ),

388-396: Gate server-side persistence to transcripts setting.

Avoid persisting when transcripts are disabled.

-        store=True,
+        store=is_transcripts_enabled(),

308-309: SSE requires proper media type and no-buffer headers.

Without text/event-stream, many clients won’t stream tokens.

-        return StreamingResponse(response_generator(response))
+        return StreamingResponse(
+            response_generator(response),
+            media_type="text/event-stream",
+            headers={
+                "Cache-Control": "no-cache",
+                "Connection": "keep-alive",
+                "X-Accel-Buffering": "no",
+            },
+        )

88-90: Avoid logging full configuration — risk of leaking secrets.

Log only safe fields (e.g., version). Don’t serialize the whole object.

-    logger.info("Llama stack config: %s", configuration.llama_stack_configuration)
+    try:
+        cfg = configuration.llama_stack_configuration()  # if accessor; else treat as attr
+    except TypeError:
+        cfg = configuration.llama_stack_configuration  # fallback
+    logger.debug("Llama Stack version: %s", getattr(cfg, "version", "n/a"))
src/utils/suid.py (1)

37-49: Critical: Don’t coerce resp- IDs into UUIDs (remove zero‑padding/UUID parse).*

Response IDs are opaque; padding/truncation risks accepting malformed IDs and rejecting valid ones.

Apply:

-    # Handle Responses API IDs
-    if suid.startswith("resp-") or suid.startswith("resp_"):
-        token = suid[5:]
-        if not token:
-            return False
-        # If truncated (e.g., shell cut reduced length), pad to canonical UUID length
-        if len(token) < 36:
-            token = token + ("0" * (36 - len(token)))
-        try:
-            uuid.UUID(token)
-            return True
-        except (ValueError, TypeError):
-            return False
+    # Handle Responses API IDs – treat as opaque, minimally validated
+    if suid.startswith(("resp-", "resp_")):
+        token = suid[5:]
+        # non-empty token with safe charset
+        return bool(token) and all(ch.isalnum() or ch in "-_." for ch in token)
src/app/routers.py (1)

45-46: src/app/routers.py mounts conversations.router under both /v1 (line 39) and /v3 (line 45) — confirm this reuse is intentional and adjust operationIds or tags to prevent duplicate OpenAPI operations.

@luis5tb luis5tb changed the title [WIP] Responses v2 Responses v2 support Sep 29, 2025
@omertuc
Copy link
Contributor

omertuc commented Sep 29, 2025

Let's hold it until we have a conversations API

@luis5tb
Copy link
Author

luis5tb commented Oct 6, 2025

Let's hold it until we have a conversations API

Conversations (initial support) PR was merged! \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants