-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Is your feature request related to a problem?
Currently, Query Insights lacks the granularity needed for deep performance analysis at the shard level, which reduces the ability for precise bottleneck identification and resource attribution, specifically, there are 2 issues:
- We cannot see the actual search latency for each individual shard involved in a query.
- We cannot reliably attribute the reported CPU and memory usage to a specific shard ID when a node hosts multiple shards for the same index.
- These issues have become blockers for deep historical top queries analysis and also the inflight queries investigations.
What solution would you like?
We propose enhancing the top queries feature to capture and expose two key metrics at the shard level:
-
Shard level search latency. This is not the same as phase_took, which tracks phase-level latency on the coordinator node. There are two ways to get shard level latency:
- use the piggyback logic to add one more field (latency), currently in the piggybacked results we only have cpu usage and memory usage, we can add one more field (preferred).
- We can use the SearchOperationListener, which will get triggered when a shard search finishes on data nodes. But this requires us to do the correlation offline and merge results to coordinator nodes. Also, this listener is only exposed to the index module right now, so we need to expose it for plugins to use.
-
Shard ID on each task resource usage. Although it's possible to infer the shard on nodes with a routing table, it's possible that one node can have multiple shards for the same index. So it would be hard to get which shard this task resource usage object belongs to. We can solve this by modifying the task resource usage object to use shard ID instead of nodeID. ([FEATURE] Add shardID to Top N response #250)
-
The above data will be useful for inflight queries as well. If we enhance the TaskResourceInfo object (the piggyback object) to include latency and shard ID, for in-flight queries, we can get that info to display for finished shards (in [FEATURE] Add In-Flight Query Tracking and Management to Query Insights Dashboard query-insights-dashboards#152 (comment)). But for inflight queries, we will also need to call task API to get the ongoing searchShardTasks so that we can get the latency (took time since start) for the ongoing shard search tasks.
What alternatives have you considered?
To get shard-level search latency, we can use the SearchOperationListener, but as mentioned, this requires more work than enhancing the piggybacking logic.
Do you have any additional context?
opensearch-project/OpenSearch#13172
opensearch-project/OpenSearch#12399
opensearch-project/OpenSearch#17407
opensearch-project/query-insights-dashboards#152