-
Notifications
You must be signed in to change notification settings - Fork 168
add gke monitoring helm support. #1600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
name: {{ $saName }} | ||
namespace: {{ .Release.Namespace }} | ||
roleRef: | ||
kind: ClusterRole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not make new ClusterRoles, #1393 is asking converting existing Cluster RBAC to namespace scoped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a namespaced scope rbac, make sure all namespace align: epp, secret, and the namespace in cluster pod monitoring.
Also, since we are using a namespace scoped objects, we can consider using the podMonitoring instead a cluster pod monitoring, where the podMonitoring is also namespace scoped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffLuoo does PodMonitoring exist outside of GKE?
I think it would be good if we can find a namespace scoped solution, but one that works for all deployment options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodMonitoring
is available by default on GKE. But people can install it on any K8s distribution by deploying it manually using https://github.com/GoogleCloudPlatform/prometheus-engine.
@sallyom adds an option using the prometheus-operator
: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1425/files and I believe it is namespace scoped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or did you mean PodMonitor from Prometheus operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a new commit(2d9a5e5) to make most of the Cluster scoped resource to namespaced. However, the following two resources is kept to make the gmp scraping metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: {{ $roleName }}
rules:
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: {{ $roleBindingName }}
subjects:
- kind: ServiceAccount
name: {{ $saName }}
namespace: {{ .Release.Namespace }}
roleRef:
kind: ClusterRole
name: {{ $roleName }}
apiGroup: rbac.authorization.k8s.io
- if I change ClusterRole to Role, it does not let me set a nonResourecesURls
Error: UPGRADE FAILED: failed to create resource: Role.rbac.authorization.k8s.io "infpool-gemma-2b-metrics-reader" is invalid: rules[0].nonResourceURLs: Invalid value: []string{"/metrics"}: namespaced rules cannot apply to non-resource URLs
- if I remove those two completely, the GMP just cannot scraping the metrics because of permission issue. I thought the following Podmonitoring and secret-read role binding should work but it didn't. Is this expected? @JeffLuoo
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: {{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
spec:
endpoints:
- port: metrics
scheme: http
interval: {{ .Values.inferenceExtension.monitoring.interval }}
path: /metrics
authorization:
type: Bearer
credentials:
secret:
name: {{ $secretName }}
key: token
selector:
matchLabels:
{{- include "gateway-api-inference-extension.selectorLabels" . | nindent 8 }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The nonResourceURLs
for metrics URL is required. And the nonResourceURLs is cluster scoped: https://github.com/kubernetes/kubernetes/blob/f42b497cf25548aa0f327c675e11c57240bfab4b/staging/src/k8s.io/api/rbac/v1/types.go#L68-L69. Can you try keep the two roles you removed and see if PodMonitoring would work then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which two roles are you referring to? the current commit is a working version. just having the ClusterRole and ClusterRoleBinding for metrics read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted a summary of this in #1393 (comment). TLDR is that if Cluster RBAC is unavoidable, we can use uniquely named Cluster RBAC names to avoid collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, thanks, amend the commit, now I only leave the ClusterRole and ClusterRoleBinding for GKE metrics read. And changed the name to include the namespace. Updated the PR description to reflect what we have now in the helm.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zetxqx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For review, you can check the PR description to see the results from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits, otherwise lgtm
gke: | ||
enabled: false | ||
# Set to true if the cluster is an Autopilot cluster. | ||
autopilot: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be future proof, let's use provider.gke.autopilot
, in case we need to parameterize other stuff for autopilot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving some thought, provider
field in values.yaml
only have a name
field. So there are a few options:
Option 1: Keep As Is (Nested under monitoring
)
This is the current approach, where the GKE-specific setting is nested directly under the feature it affects.
values.yaml
Snippet
# ...
monitoring:
interval: "10s"
# ...
gke:
enabled: false
# Set to true if the cluster is an Autopilot cluster.
autopilot: false
# ...
Option 2: Centralized Provider with the current name
field
Given there is a name field under provider.
values.yaml
Snippet
provider:
# The name of the provider. Supported values: "gke", "none".
name: gke
# GKE-specific configuration.
# This block is only used if name is "gke".
gke:
# Set to true if the cluster is an Autopilot cluster.
autopilot: false
Option 3: Exclusive Provider Block
but this maynot be backward compatible, we need to change upstream values in llm-d as well.
values.yaml
Snippet
The user enables a provider by uncommenting its block. Only one block should be active.
# Cloud provider specific configuration.
# You MUST enable exactly ONE provider.
provider:
# Google Kubernetes Engine (GKE) specific configuration
gke:
# Set to true if the cluster is an Autopilot cluster.
# This is optional and defaults to false if not set.
autopilot: false
# Generic provider for non-cloud-specific setups (would be commented out)
# none: {}
Given the above three option, I feel keep it as is may be simple, and currently autopilot is only needed for monitoring? If we have new feature coming in, we can refactor the values structure at that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with this, we probably don't need to treat helm as strong as APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I like the second option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use the second option, please take a look
/lgtm @JeffLuoo can you do another pass? |
New changes are detected. LGTM label has been removed. |
/lgtm Thanks for adding it! |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Following https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/inference-optimized-gateway,
Add helm conditional GKE Monitoring: All GKE-specific monitoring resources (ClusterPodMonitoring, ServiceAccount, Secret, and associated RBAC rules) are added and wrapped in a conditional block. They will only be deployed if
inferenceExtension.monitoring.gke.enabled
is set to true in values.yaml. This prevents the creation of unnecessary resources when GKE monitoring is not required.Tested by using the following command
Which issue(s) this PR fixes:
Fixes #1452
Does this PR introduce a user-facing change?: