add gke monitoring helm support. #1600

zetxqx · 2025-09-16T01:29:47Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

Following https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/inference-optimized-gateway,

Add helm conditional GKE Monitoring: All GKE-specific monitoring resources (ClusterPodMonitoring, ServiceAccount, Secret, and associated RBAC rules) are added and wrapped in a conditional block. They will only be deployed if inferenceExtension.monitoring.gke.enabled is set to true in values.yaml. This prevents the creation of unnecessary resources when GKE monitoring is not required.

Tested by using the following command

❯ export NAMESPACE=inference-demo
export HELM_RELEASE_NAME=infpool-gemma-2b

❯ helm upgrade -i $HELM_RELEASE_NAME \
  config/charts/inferencepool \
  -n $NAMESPACE \
  --create-namespace \
  --set inferencePool.modelServers.matchLabels.app=vllm-gemma2b \
  --set provider.name=gke \
  --set inferenceExtension.monitoring.gke.enabled=true
  
❯ helm status infpool-gemma-2b --show-resources
NAME: infpool-gemma-2b
LAST DEPLOYED: Tue Sep 16 22:07:24 2025
NAMESPACE: inference-demo
STATUS: deployed
REVISION: 12
RESOURCES:
==> v1/InferencePool
NAME               AGE
infpool-gemma-2b   3h21m

==> v1/PodMonitoring
infpool-gemma-2b   3h21m

==> v1/ServiceAccount
NAME                                 SECRETS   AGE
infpool-gemma-2b-metrics-reader-sa   0         3h21m
infpool-gemma-2b-epp   0     3h21m

==> v1/Secret
NAME                                     TYPE                                  DATA   AGE
infpool-gemma-2b-metrics-reader-secret   kubernetes.io/service-account-token   3      3h21m

==> v1/ConfigMap
NAME                   DATA   AGE
infpool-gemma-2b-epp   1      3h21m

==> v1/ClusterRole
NAME                                             CREATED AT
inference-demo-infpool-gemma-2b-metrics-reader   2025-09-16T22:07:26Z
infpool-gemma-2b-inference-demo-epp   2025-09-16T18:47:18Z

==> v1/ClusterRoleBinding
NAME                                                          ROLE                                                         AGE
inference-demo-infpool-gemma-2b-metrics-reader-role-binding   ClusterRole/inference-demo-infpool-gemma-2b-metrics-reader   74s
infpool-gemma-2b-inference-demo-epp   ClusterRole/infpool-gemma-2b-inference-demo-epp   3h21m

==> v1/RoleBinding
NAME                                                                              ROLE                                               AGE
gmp-system:collector:inference-demo-infpool-gemma-2b-metrics-reader-secret-read   Role/infpool-gemma-2b-metrics-reader-secret-read   3h21m
infpool-gemma-2b-epp   Role/infpool-gemma-2b-epp   3h21m

==> v1/Service
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
infpool-gemma-2b-epp   ClusterIP   34.118.236.102   <none>        9002/TCP,9090/TCP   3h21m

==> v1/Deployment
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
infpool-gemma-2b-epp   1/1     1            1           3h21m

==> v1/Role
NAME                                          CREATED AT
infpool-gemma-2b-metrics-reader-secret-read   2025-09-16T18:47:19Z
infpool-gemma-2b-epp   2025-09-16T18:47:19Z

==> v1/Pod(related)
NAME                                    READY   STATUS    RESTARTS   AGE
infpool-gemma-2b-epp-868c7675c6-rbvw9   1/1     Running   0          3h21m

==> v1/HealthCheckPolicy
NAME               AGE
infpool-gemma-2b   3h21m


TEST SUITE: None
NOTES:
InferencePool infpool-gemma-2b deployed.

Which issue(s) this PR fixes:

Fixes #1452

Does this PR introduce a user-facing change?:

NONE

netlify · 2025-09-16T01:30:18Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`39a943b`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68ccb0e217aa5e0008df9ea5
😎 Deploy Preview	https://deploy-preview-1600--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

zetxqx · 2025-09-16T01:31:16Z

/assign @JeffLuoo @liu-cong @ahg-g

Configured the helm chart for gke monitoring, could you take a look?

liu-cong · 2025-09-16T16:23:41Z

config/charts/inferencepool/templates/gke.yaml

+  name: {{ $saName }}
+  namespace: {{ .Release.Namespace }}
+roleRef:
+  kind: ClusterRole


Let's not make new ClusterRoles, #1393 is asking converting existing Cluster RBAC to namespace scoped.

If we use a namespaced scope rbac, make sure all namespace align: epp, secret, and the namespace in cluster pod monitoring.

Also, since we are using a namespace scoped objects, we can consider using the podMonitoring instead a cluster pod monitoring, where the podMonitoring is also namespace scoped.

@JeffLuoo does PodMonitoring exist outside of GKE?
I think it would be good if we can find a namespace scoped solution, but one that works for all deployment options.

PodMonitoring is available by default on GKE. But people can install it on any K8s distribution by deploying it manually using https://github.com/GoogleCloudPlatform/prometheus-engine.

@sallyom adds an option using the prometheus-operator: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1425/files and I believe it is namespace scoped.

or did you mean PodMonitor from Prometheus operator?

I made a new commit(2d9a5e5) to make most of the Cluster scoped resource to namespaced. However, the following two resources is kept to make the gmp scraping metrics

--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: {{ $roleName }} rules: - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: {{ $roleBindingName }} subjects: - kind: ServiceAccount name: {{ $saName }} namespace: {{ .Release.Namespace }} roleRef: kind: ClusterRole name: {{ $roleName }} apiGroup: rbac.authorization.k8s.io

if I change ClusterRole to Role, it does not let me set a nonResourecesURls

Error: UPGRADE FAILED: failed to create resource: Role.rbac.authorization.k8s.io "infpool-gemma-2b-metrics-reader" is invalid: rules[0].nonResourceURLs: Invalid value: []string{"/metrics"}: namespaced rules cannot apply to non-resource URLs

if I remove those two completely, the GMP just cannot scraping the metrics because of permission issue. I thought the following Podmonitoring and secret-read role binding should work but it didn't. Is this expected? @JeffLuoo

apiVersion: monitoring.googleapis.com/v1 kind: PodMonitoring metadata: name: {{ .Release.Name }} namespace: {{ .Release.Namespace }} labels: {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} spec: endpoints: - port: metrics scheme: http interval: {{ .Values.inferenceExtension.monitoring.interval }} path: /metrics authorization: type: Bearer credentials: secret: name: {{ $secretName }} key: token selector: matchLabels: {{- include "gateway-api-inference-extension.selectorLabels" . | nindent 8 }}

Yes. The nonResourceURLs for metrics URL is required. And the nonResourceURLs is cluster scoped: https://github.com/kubernetes/kubernetes/blob/f42b497cf25548aa0f327c675e11c57240bfab4b/staging/src/k8s.io/api/rbac/v1/types.go#L68-L69. Can you try keep the two roles you removed and see if PodMonitoring would work then?

which two roles are you referring to? the current commit is a working version. just having the ClusterRole and ClusterRoleBinding for metrics read.

I attempted a summary of this in #1393 (comment). TLDR is that if Cluster RBAC is unavoidable, we can use uniquely named Cluster RBAC names to avoid collision.

SG, thanks, amend the commit, now I only leave the ClusterRole and ClusterRoleBinding for GKE metrics read. And changed the name to include the namespace. Updated the PR description to reflect what we have now in the helm.

config/charts/inferencepool/templates/epp-sa-token-secret.yaml

config/charts/inferencepool/templates/gke.yaml

k8s-ci-robot · 2025-09-16T19:59:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zetxqx
Once this PR has been reviewed and has the lgtm label, please ask for approval from ahg-g. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zetxqx · 2025-09-16T22:14:36Z

For review, you can check the PR description to see the results from helm status to see if the helm installed resources is as expected: #1600 (comment)

zetxqx · 2025-09-18T21:37:09Z

@liu-cong @JeffLuoo can you take another look again? we'll have another patch release #1616 . It will be great if this one can catch it.

liu-cong

a few nits, otherwise lgtm

config/charts/inferencepool/README.md

config/charts/inferencepool/templates/epp-sa-token-secret.yaml

liu-cong · 2025-09-18T22:12:52Z

config/charts/inferencepool/values.yaml

+    gke:
+      enabled: false
+      # Set to true if the cluster is an Autopilot cluster.
+      autopilot: false


to be future proof, let's use provider.gke.autopilot, in case we need to parameterize other stuff for autopilot.

Giving some thought, provider field in values.yaml only have a name field. So there are a few options:

Option 1: Keep As Is (Nested under monitoring)

This is the current approach, where the GKE-specific setting is nested directly under the feature it affects.

values.yaml Snippet

# ... monitoring: interval: "10s" # ... gke: enabled: false # Set to true if the cluster is an Autopilot cluster. autopilot: false # ...

Option 2: Centralized Provider with the current name field

Given there is a name field under provider.

values.yaml Snippet

provider: # The name of the provider. Supported values: "gke", "none". name: gke # GKE-specific configuration. # This block is only used if name is "gke". gke: # Set to true if the cluster is an Autopilot cluster. autopilot: false

Option 3: Exclusive Provider Block

but this maynot be backward compatible, we need to change upstream values in llm-d as well.

values.yaml Snippet

The user enables a provider by uncommenting its block. Only one block should be active.

# Cloud provider specific configuration. # You MUST enable exactly ONE provider. provider: # Google Kubernetes Engine (GKE) specific configuration gke: # Set to true if the cluster is an Autopilot cluster. # This is optional and defaults to false if not set. autopilot: false # Generic provider for non-cloud-specific setups (would be commented out) # none: {}

Given the above three option, I feel keep it as is may be simple, and currently autopilot is only needed for monitoring? If we have new feature coming in, we can refactor the values structure at that time?

I am OK with this, we probably don't need to treat helm as strong as APIs.

fwiw, I like the second option

Updated to use the second option, please take a look

liu-cong · 2025-09-18T23:33:45Z

/lgtm

@JeffLuoo can you do another pass?

k8s-ci-robot · 2025-09-19T01:22:03Z

New changes are detected. LGTM label has been removed.

JeffLuoo · 2025-09-19T01:26:57Z

/lgtm

Thanks for adding it!

fix gke monitoring.

f413e5c

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 16, 2025

k8s-ci-robot requested review from danehans and elevran September 16, 2025 01:29

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 16, 2025

k8s-ci-robot assigned ahg-g, JeffLuoo and liu-cong Sep 16, 2025

liu-cong reviewed Sep 16, 2025

View reviewed changes

JeffLuoo reviewed Sep 16, 2025

View reviewed changes

config/charts/inferencepool/templates/epp-sa-token-secret.yaml Show resolved Hide resolved

config/charts/inferencepool/templates/gke.yaml Outdated Show resolved Hide resolved

liu-cong mentioned this pull request Sep 16, 2025

Convert cluster scoped RBAC to namespace scoped #1393

Open

change to namespaced resources as much as possible.

05900ee

zetxqx force-pushed the obserhelm branch from 2d9a5e5 to 05900ee Compare September 16, 2025 22:09

update helm chart readme.

2dc8516

zetxqx mentioned this pull request Sep 18, 2025

v1.0.1 patch release #1616

Open

liu-cong reviewed Sep 18, 2025

View reviewed changes

resolve nits.

f43ae48

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 18, 2025

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2025

move autopilot to provider.gke.

39a943b

zetxqx force-pushed the obserhelm branch from c5be571 to 39a943b Compare September 19, 2025 01:24

add gke monitoring helm support. #1600

Are you sure you want to change the base?

add gke monitoring helm support. #1600

Conversation

zetxqx commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

zetxqx commented Sep 16, 2025

Uh oh!

liu-cong Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeffLuoo Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zetxqx Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 16, 2025

Uh oh!

zetxqx commented Sep 16, 2025

Uh oh!

zetxqx commented Sep 18, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Option 1: Keep As Is (Nested under monitoring)

values.yaml Snippet

Option 2: Centralized Provider with the current name field

values.yaml Snippet

Option 3: Exclusive Provider Block

values.yaml Snippet

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Sep 18, 2025

Uh oh!

k8s-ci-robot commented Sep 19, 2025

Uh oh!

JeffLuoo commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zetxqx commented Sep 16, 2025 •

edited

Loading

netlify bot commented Sep 16, 2025 •

edited

Loading

liu-cong Sep 16, 2025 •

edited

Loading

JeffLuoo Sep 16, 2025 •

edited

Loading

zetxqx Sep 16, 2025 •

edited

Loading

Option 1: Keep As Is (Nested under `monitoring`)

`values.yaml` Snippet

Option 2: Centralized Provider with the current `name` field

`values.yaml` Snippet

`values.yaml` Snippet

JeffLuoo commented Sep 19, 2025 •

edited

Loading