Skip to content

Commit 4714892

Browse files
authored
[Feature] Improve and fix Prometheus & Grafana integrations (#895)
The old document (observability.md) of Prometheus and Grafana has some issues that make users hard to follow. This PR fixes these issues.
1 parent f058924 commit 4714892

File tree

7 files changed

+211
-257
lines changed

7 files changed

+211
-257
lines changed

config/prometheus/podMonitor.yaml

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,19 @@ kind: PodMonitor
33
metadata:
44
name: ray-workers-monitor
55
namespace: prometheus-system
6+
labels:
7+
# `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label.
8+
release: prometheus
69
spec:
710
jobLabel: ray-workers
11+
# Only select Kubernetes Pods in the "default" namespace.
812
namespaceSelector:
913
matchNames:
1014
- default
11-
- ray-system
15+
# Only select Kubernetes Pods with "matchLabels".
1216
selector:
1317
matchLabels:
1418
ray.io/node-type: worker
15-
ray.io/is-ray-node: "yes"
19+
# A list of endpoints allowed as part of this PodMonitor.
1620
podMetricsEndpoints:
1721
- port: metrics
18-
interval: 1m
19-
scrapeTimeout: 10s
20-
# - targetPort: 90001
21-
# interval: 1m
22-
# scrapeTimeout: 10s
23-

config/prometheus/serviceMonitor.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,19 @@ metadata:
44
name: ray-head-monitor
55
namespace: prometheus-system
66
labels:
7-
release: prometheus-operator
8-
ray.io/node-type: head
7+
# `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.
8+
release: prometheus
99
spec:
1010
jobLabel: ray-head
11+
# Only select Kubernetes Services in the "default" namespace.
1112
namespaceSelector:
1213
matchNames:
1314
- default
14-
- ray-system
15+
# Only select Kubernetes Services with "matchLabels".
1516
selector:
1617
matchLabels:
1718
ray.io/node-type: head
19+
# A list of endpoints allowed as part of this ServiceMonitor.
1820
endpoints:
1921
- port: metrics
2022
targetLabels:

docs/guidance/observability.md

Lines changed: 1 addition & 245 deletions
Original file line numberDiff line numberDiff line change
@@ -51,248 +51,4 @@ curl --request GET '<baseUrl>/apis/v1alpha2/namespaces/<namespace>/clusters/<ray
5151

5252
## Ray Cluster: Monitoring with Prometheus & Grafana
5353

54-
In this section we will describe how to monitor Ray Clusters in Kubernetes using Prometheus & Grafana.
55-
56-
We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system.
57-
58-
Requirements:
59-
- Prometheus deployed in Kubernetes
60-
- Required CRD: `servicemonitors.monitoring.coreos.com`
61-
- Requered CRD: `podmonitors.monitoring.coreos.com`
62-
- Grafana up and running
63-
64-
### Enable Ray Cluster Metrics
65-
66-
Before we define any Prometheus objects, let us first enable and export metrics to a specific port.
67-
68-
To enable ray metrics on Head node or a worker node, we need to pass the following option `--metrics-expose-port=9001`. We can set the specific option by adding `metrics-export-port: "9001"` to the head node & worker nodes in the rayclusters.ray.io manifest.
69-
70-
We also need to export port `9001` in the head node & worker nodes
71-
72-
```yaml
73-
apiVersion: ray.io/v1alpha1
74-
kind: RayCluster
75-
metadata:
76-
...
77-
name: ray-cluster
78-
spec:
79-
enableInTreeAutoscaling: true
80-
headGroupSpec:
81-
rayStartParams:
82-
...
83-
metrics-export-port: "9001" <--- Enable for the head node
84-
...
85-
template:
86-
metadata:
87-
...
88-
spec:
89-
...
90-
containers:
91-
- ports:
92-
- containerPort: 10001
93-
name: client
94-
protocol: TCP
95-
- containerPort: 8265
96-
name: dashboard
97-
protocol: TCP
98-
- containerPort: 8000
99-
name: ray-serve
100-
protocol: TCP
101-
- containerPort: 6379
102-
name: redis
103-
protocol: TCP
104-
- containerPort: 9001
105-
name: metrics
106-
protocol: TCP
107-
workerGroupSpecs:
108-
- groupName: workergroup
109-
...
110-
rayStartParams:
111-
...
112-
metrics-export-port: "9001" <--- Enable for worker nodes
113-
...
114-
template:
115-
metadata:
116-
...
117-
spec:
118-
...
119-
containers:
120-
- ports:
121-
- containerPort: 9001
122-
name: metrics
123-
protocol: TCP
124-
...
125-
...
126-
```
127-
128-
If you use `$kuberay/helm-chart/ray-cluster`, then you can add it in the `values.yaml`
129-
130-
```yaml
131-
head:
132-
groupName: headgroup
133-
...
134-
initArgs:
135-
metrics-export-port: "9001" <--- Enable for the head node
136-
...
137-
ports:
138-
- containerPort: 10001
139-
protocol: TCP
140-
name: "client"
141-
- containerPort: 8265
142-
protocol: TCP
143-
name: "dashboard"
144-
- containerPort: 8000
145-
protocol: TCP
146-
name: "ray-serve"
147-
- containerPort: 6379
148-
protocol: TCP
149-
name: "redis"
150-
- containerPort: 9001 <--- Enable this port
151-
protocol: TCP
152-
name: "metrics"
153-
...
154-
worker:
155-
groupName: workergroup
156-
...
157-
initArgs:
158-
...
159-
metrics-export-port: "9001" <--- Enable for the head node
160-
ports:
161-
- containerPort: 9001 <--- Enable this port
162-
protocol: TCP
163-
name: "metrics"
164-
...
165-
...
166-
```
167-
168-
Deploying the cluster with the above options should export metrics on port `9001`. To check, we can port-forward port `9001` to our localhost and query via curl.
169-
170-
```bash
171-
k port-forward <ray-head-node-id> 9001:9001
172-
```
173-
174-
From a second terminal issue
175-
176-
```bash
177-
$> curl localhost:9001
178-
# TYPE ray_pull_manager_object_request_time_ms histogram
179-
...
180-
ray_pull_manager_object_request_time_ms_sum{Component="raylet",...
181-
...
182-
```
183-
184-
Before we move on, first ensure that the required metrics port is also defined in the Ray's cluster Kubernetes service. This is done automatically via the Ray Operator if you define the metrics port `containerPort: 9001` along with the name and protocol.
185-
186-
```bash
187-
$> kubectl get svc <ray-cluster-name>-head-svc -o yaml
188-
NAME TYPE ... PORT(S) ...
189-
... ClusterIP ... 6379/TCP,9001/TCP,10001/TCP,8265/TCP,8000/TCP ...
190-
```
191-
192-
We are now ready to create the required Prometheus CRDs to collect metrics
193-
194-
### Collect Head Node metrics with ServiceMonitors
195-
196-
Prometheus provides a CRD that targets Kubernetes services to collect metrics. The idea is that we will define a CRD that will have selectors that match the Ray Cluster Kubernetes service labels and ports, the metrics port.
197-
198-
```yaml
199-
apiVersion: monitoring.coreos.com/v1
200-
kind: ServiceMonitor
201-
metadata:
202-
name: <ray-cluster-name>-head-monitor <-- Replace <ray-cluster-name> with the actual Ray Cluster name
203-
namespace: <ray-cluster-namespace> <-- Add the namespace of your ray cluster
204-
spec:
205-
endpoints:
206-
- interval: 1m
207-
path: /metrics
208-
scrapeTimeout: 10s
209-
port: metrics
210-
jobLabel: <ray-cluster-name>-ray-head <-- Replace <ray-cluster-name> with the actual Ray Cluster name
211-
namespaceSelector:
212-
matchNames:
213-
- <ray-cluster-namespace> <-- Add the namespace of your ray cluster
214-
selector:
215-
matchLabels:
216-
ray.io/cluster: <ray-cluster-name> <-- Replace <ray-cluster-name> with the actual Ray Cluster name
217-
ray.io/identifier: <ray-cluster-name>-head <-- Replace <ray-cluster-name> with the actual Ray Cluster name
218-
ray.io/node-type: head
219-
targetLabels:
220-
- ray.io/cluster
221-
```
222-
223-
A notes for the `targetLabels`. We added `spec.targetLabels[0].ray.io/cluster` because we want to include the name of the ray cluster in the metrics that will be generated by this service monitor. The `ray.io/cluster` label is part of the Ray head node service and it will be transformed to a `ray_io_cluster` metric label. That is, any metric that will be imported, will also container the following label `ray_io_cluster=<ray-cluster-name>`. This may seem like optional but it becomes mandatory if you deploy multiple ray clusters.
224-
225-
Create the above service monitor by issuing
226-
227-
```bash
228-
k apply -f serviceMonitor.yaml
229-
```
230-
231-
After a while, Prometheus should start scraping metrics from the head node. You can confirm that by visiting the Prometheus web ui and start typing `ray_`. Prometheus should create a dropdown list with suggested Ray metrics.
232-
233-
```bash
234-
curl 'https://<prometheus-endpoint>/api/v1/query?query=ray_object_store_available_memory' -H 'Accept: */*'
235-
```
236-
237-
### Collect Worker Node metrics with PodMonitors
238-
239-
Ray operator does not create a Kubernetes service for the ray workers, therefore we can not use a Prometheus ServiceMonitors to scrape the metrics from our workers.
240-
241-
**Note**: We could create a Kubernetes service with selectors a common label subset from our worker pods, however this is not ideal because our workers are independent from each other, that is, they are not a collection of replicas spawned by replicaset controller. Due to that, we should avoid using a Kubernetes service for grouping them together.
242-
243-
To collect worker metrics, we can use `Prometheus PodMonitros CRD`.
244-
245-
```yaml
246-
apiVersion: monitoring.coreos.com/v1
247-
kind: PodMonitor
248-
metadata:
249-
labels:
250-
ray.io/cluster: <ray-cluster-name> <-- Replace <ray-cluster-name> with the actual Ray Cluster name
251-
name: <ray-cluster-name>-workers-monitor <-- Replace <ray-cluster-name> with the actual Ray Cluster name
252-
namespace: <ray-cluster-namespace> <-- Add the namespace of your ray cluster
253-
spec:
254-
jobLabel: <ray-cluster-name>-ray-workers <-- Replace <ray-cluster-name> with the actual Ray Cluster name
255-
namespaceSelector:
256-
matchNames:
257-
- <ray-cluster-namespace> <-- Add the namespace of your ray cluster
258-
podMetricsEndpoints:
259-
- interval: 30s
260-
port: metrics
261-
scrapeTimeout: 10s
262-
podTargetLabels:
263-
- ray.io/cluster
264-
selector:
265-
matchLabels:
266-
ray.io/is-ray-node: "yes"
267-
ray.io/node-type: worker
268-
```
269-
270-
Since we are not selecting a Kubernetes service but pods, our `matchLabels` now define a set of labels that is common on all Ray workers.
271-
272-
We also define `metadata.labels` by manually adding `ray.io/cluster: <ray-cluster-name>` and then instructing the PodMonitors resource to add that label in the scraped metrics via `spec.podTargetLabels[0].ray.io/cluster`.
273-
274-
Apply the above PodMonitor manifest
275-
276-
```bash
277-
k apply -f podMonitor.yaml
278-
```
279-
280-
Last, wait a bit and then ensure that you can see Ray worker metrics in Prometheus
281-
282-
```bash
283-
curl 'https://<prometheus-endpoint>/api/v1/query?query=ray_object_store_available_memory' -H 'Accept: */*'
284-
```
285-
286-
The above http query should yield metrics from the head node and your worker nodes
287-
288-
We have everything we need now and we can use Grafana to create some panels and visualize the scrapped metrics
289-
290-
### Grafana: Visualize ingested Ray metrics
291-
292-
You can use the json in `config/grafana` to import in Grafana the Ray dashboards.
293-
294-
### Custom Metrics & Alerting
295-
296-
We can also define custom metrics, and create alerts by using `prometheusrules.monitoring.coreos.com` CRD. Because custom metrics, and alerting is different for each team and setup, we have included an example under `$kuberay/config/prometheus/rules` that you can use to build custom metrics and alerts
297-
298-
54+
See [prometheus-grafana.md](./prometheus-grafana.md) for more details.

0 commit comments

Comments
 (0)