feat: Append existing resource labels to all related metrics #2129

alen-z · 2023-07-26T12:20:17Z

⚠️ Disclaimer: Please guide me with your comments to make this PR better. I haven't touched any code for years and it's my first serious interaction with Go. Thank you for your understanding in advance.

What this PR does / why we need it:

KSM is great, but to make it even better (fit our use case) here's the proposal on how to expand resource related metrics with custom labels. Labels are expected to already be defined as part of the resource.

Now supported resources:

Pod
Service
...

Issue is that Prometheus relabelings can't be used for KSM because Prometheus scrapes only KSM container which provides metrics for many resources scattered around the cluster.

Ultimately, this should allow us to get custom labels in Alertmanager without any default alert rule changes in Prometheus. Having custom labels can significantly improve alert routing based on more detailed labels.

Example:
Set --metric-labels-append=regex=meta.example.com/(.*),product and get:

kube_pod_container_info{container="container1",container_id="docker://ab123",image="k8s.gcr.io/hyperkube1",image_id="docker://sha256:aaa",image_spec="k8s.gcr.io/hyperkube1_spec",meta_example_com_owner="team1",namespace="ns1",pod="pod1",product="frontend1",uid="uid1"} 1

Proposed phases

Get feedback for Pod metrics and labels. Polish to the degree so it can be expanded to other resources.
Expand this method to all relevant resources.
(optional) Include replacement mechanism when regex is used. Similar to Prometheus relabelings. This should allow us to rename matched labels.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)

Increases. Adding labels to exising metrics.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes prometheus-operator/kube-prometheus#887
Fixes #536

linux-foundation-easycla · 2023-07-26T12:20:20Z

The committers listed above are authorized under a signed CLA.

✅ login: alen-z / name: Alen Zubic (239fdbd, aa2f6b9)

k8s-ci-robot · 2023-07-26T12:20:25Z

Welcome @alen-z!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

dashpole · 2023-07-27T16:46:02Z

/assign @dgrisonnet
/triage accepted

alen-z · 2023-07-27T22:04:49Z

Hi @dashpole, thank you for taking this one further.

Hey @dgrisonnet, nice to meet you. Currently there's support for Pod and Service resources. I'll be extending the code to all resources, but until then please let me know if there's anything else that needs to be done or done differently to comply with your expectations. New commits soon...

k8s-ci-robot · 2023-08-22T19:29:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alen-z
Once this PR has been reviewed and has the lgtm label, please ask for approval from dgrisonnet. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alen-z · 2023-08-22T19:30:46Z

Planning to allocate some time to finish this PR soon...

dgrisonnet · 2023-08-23T13:26:41Z

Hi @alen-z sorry for the delay, I somehow missed this PR.

One thing I am not sure I understand properly is how different your proposal is to the existing --metric-labels-allowlist option and the _labels metrics. Are there any gaps in the existing mechanism to expose labels value that you would want to be covered here?

alen-z · 2023-08-24T07:18:20Z

Are there any gaps in the existing mechanism to expose labels value that you would want to be covered here?

Hi @dgrisonnet, yes. With current state, as far as I understand, we can not get arbitrary/custom Prometheus metric labels into all metrics that kube-state-metrics provides — we want to include labels in metrics from e.g. existing Pod or Service labels.

Having custom labels in only one kube-state-metrics metric (--metric-labels-allowlist ) does not work for us. Here's why...

Specific use case is: We have certain set of metadata related to our product (kind of service directory metadata) that we deploy along with our Kubernetes resources in form of resource manifest labels. Those manifest labels contain information about teams that own the product, communication channel etc. — now, we want to route native Prometheus alerts based on this custom owner label to proper teams! This is only possible if, among other metrics, kube-state-metrics has owner label in its metrics.

This is only one use case, there might be others related to visualization of Prometheus data based on filters or for pulling out some statistics grouped by custom label...

Please let me know if there is different approach to our use case that I might have missed.

dgrisonnet · 2023-08-24T12:53:40Z

With current state, as far as I understand, we can not get arbitrary/custom Prometheus metric labels into all metrics that kube-state-metrics provides

That's totally intentional because having these labels in all the metrics would increase the cardinality of the metrics tremendously and would cause kube-state-metrics to generate even more timeseries than it is currently. Leading to higher memory consumption from Prometheus and potentially higher cost as well for users using a SAS monitoring solution.

Your request is very similar to these two ones that were opened some time ago:

And both your use case and their requests can be solved today by using joins in the promQL queries of your alerts on kube_*_labels to surface the labels that are necessary. And I am not convinced that the convenience of having the labels in all the metrics is worth the cost of exposing these info everywhere when we already have them in one place.

k8s-ci-robot · 2023-08-24T12:53:48Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DanTulovsky · 2023-08-24T13:24:31Z

We have a very similar use case. The problem is we don't use prometheus for queries, although we do still collect metrics via prometheus in certain instances (like kube-state-metrics). Really it's via the otel-collect prometheus receiver. So unfortunately the solution to join metrics at query time does not currently work for us.

dgrisonnet · 2023-08-24T16:00:29Z

@DanTulovsky you could perhaps use the Prometheus agent to aggregate the metrics in the way you want and then remote-write them to your otel collector?

https://prometheus.io/blog/2021/11/16/agent/

dgrisonnet · 2023-08-24T16:01:14Z

IMO your needs are outside of ksm scope and need a middleware to be achieved such as the Prometheus agent.

DanTulovsky · 2023-08-24T16:38:18Z

I understand your concerns, but since this is an off by default feature, it would not break any existing users. People who choose to use this will need to understand the ramifications. For us, it would make things much simpler.

The real solution here is for something like a kube-state-metrics receiver in the otel-collector :)

dgrisonnet · 2023-08-25T19:30:50Z

Still, having something like that is out of scope for kube-state-metrics. The only purpose of kube-state-metrics is to expose metrics about Kubernetes objects with a 1:1 mapping. Aggregation and relabeling should be done by other tools.
In this micro services era, it is very important to stick to original scopes.

but since this is an off by default feature, it would not break any existing users.

This is not a valid reason to add a feature to a project. At the end of the day, we would have to maintain that new functionality and most likely extend it for future needs. No extending the project past its original purpose is intentional to keep it maintainable and avoid making it too complex to use for the users.

DanTulovsky · 2023-08-25T20:36:19Z

I guess I don't understand how this is different from --metric-labels-allowlist, which allows * and comes with a huge warning. Why allow the former, but not this one?

alen-z · 2023-08-27T08:33:32Z

Hi @dgrisonnet 👋

metrics would increase the cardinality

I would let cluster operators decide what they want to do and provide them with the mechanisms to do it. This goes along what @DanTulovsky is saying with "People who choose to use this will need to understand the ramifications". Nothing would change by default for existing users.

using joins in the promQL queries

I'd say yes in theory, you are right, but the reality is many alert definitions are used from here, not being applicable to change (use join).

we already have them in one place (metrics)

Which is not a very convenient place for us and I'd argue for number of other people and use cases too.

The only purpose of kube-state-metrics is to expose metrics about Kubernetes objects with a 1:1 mapping. Aggregation and relabeling should be done by other tools.

It'd still expose information about objects with 1:1 mapping, only to describe them better with their own labels that already exist? Aggregation without it is hard, not impossible, but hard in reality.

--

Ultimately, @dgrisonnet and @rexagod, I'm interested should we keep this PR open and continue the work here or should we close and fork it privately to support our use case, without impacting this project? I suspected this might create some controversy to be honest, so not really surprised to meet the skepticism :) Good discussion.

Looking forward to hear from you.

dgrisonnet · 2023-08-28T14:41:29Z

I guess I don't understand how this is different from --metric-labels-allowlist, which allows * and comes with a huge warning. Why allow the former, but not this one?

In the current case, we add this information only once per resource. That's already very expensive if you expose all the labels of all the resources in your clusters hence the warning, but if you were to do that for all the metrics exported by kube-state-metrics, the additional cost would be tremendous.

dgrisonnet · 2023-08-28T15:27:28Z

I would let cluster operators decide what they want to do and provide them with the mechanisms to do it. This goes along what @DanTulovsky is saying with "People who choose to use this will need to understand the ramifications".

Ultimately that's the goal with kube-state-metrics. Empowering cluster operators to tune the project to match their needs without needing any knowledge about the codebase. That's why we've released support for CRD and more recently have been discussing making kube-state-metrics configuration based: #2165.

But, on top of the additional resource consumption from storing the labels information multiple times, adding them to all the metrics would go against the current state of kube-state-metrics where we have metrics about specific parts of Kubernetes object. Each one of them have one unique purpose and that's how we currently keep kube-state-metrics simple to use and consume. Because of that, it would be wrong to add unrelated information to the various metrics.

So with the current state of kube-state-metrics, I don't think it is wise to add a mechanism that goes against the current design. Additionally, if we commit to this feature that goes against the best practices that we've put in place, we will never be able to move away from it without breaking users. However, I am not against making that possible once we fully empower users to tweak kube-state-metrics however they want.

I'd say yes in theory, you are right, but the reality is many alert definitions are used from here, not being applicable to change (use join).

Charts are hard to extend which is why https://github.com/prometheus-operator/kube-prometheus is not providing them but instead raw manifests that are generating from Jsonnet, which are way easier to extend. I would recommend looking into it if you want to update the rules to match your needs.

It'd still expose information about objects with 1:1 mapping, only to describe them better with their own labels that already exist?

So then why make an exception for labels and not also add annotations, containers, statuses, ...? This is a far fetched example and I know why you specifically want labels, but that would break a boundary that we've set between the information.

Aggregation without it is hard, not impossible, but hard in reality.

I totally hear that, but I don't see a way we could improve that today without breaking the current UX of kube-state-metrics.
But this for sure, re-emphasize the idea that we should empower users to extend ksm however they want. I will for sure keep that in mind while designing the future UX.

I'm interested should we keep this PR open and continue the work here or should we close and fork it privately to support our use case, without impacting this project?

As you can probably tell, I am not a fan of adding this feature to ksm as it would be considered out of scope right now and I'd prefer leaving any kind of aggregation to a third party.
So I would personally reject the feature for now and maybe revive it in the future if we go forward with #2165.

alen-z · 2023-08-28T16:11:34Z

I'd prefer much stronger arguments, but closing the PR.

dgrisonnet · 2023-08-28T17:41:21Z

I mean the arguments are pretty strong to me:

increase memory consumption for duplicated data
manipulation on the metrics is outside kube-state-metrics' scope and should be done at the aggregation layer
this feature goes against the best practices
it changes the expected experience behind kube-state-metrics' metrics
will complexify an already very complex configuration around labels

Although I said that this feature might be reconsidered in the future if we empower our users even more, I would still discourage anyone from doing that since duplicating data is a waste of resource.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 26, 2023

k8s-ci-robot requested review from dgrisonnet and rexagod July 26, 2023 12:20

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 26, 2023

alen-z force-pushed the feat-append-labels branch 7 times, most recently from 8b18b6e to d6ceb15 Compare July 26, 2023 21:57

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 26, 2023

alen-z force-pushed the feat-append-labels branch from d6ceb15 to 97965d3 Compare July 26, 2023 22:28

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 26, 2023

alen-z changed the title ~~feat: Append existing Pod labels to all Pod related metrics~~ feat: Append existing resource labels to all related metrics Jul 26, 2023

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 26, 2023

k8s-ci-robot assigned dgrisonnet Jul 27, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 27, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2023

alen-z added 2 commits August 22, 2023 21:23

feat: Append existing resource labels to all related metrics

239fdbd

chore: Rebase to main and resolve conflicts, update Service tests

aa2f6b9

alen-z force-pushed the feat-append-labels branch from 97965d3 to aa2f6b9 Compare August 22, 2023 19:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2023

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 22, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 24, 2023

alen-z closed this Aug 28, 2023

mrueg mentioned this pull request Mar 24, 2025

draft: add `inject-per-series-metadata' #2632

Draft

feat: Append existing resource labels to all related metrics #2129

feat: Append existing resource labels to all related metrics #2129

Uh oh!

Conversation

alen-z commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 26, 2023

Uh oh!

dashpole commented Jul 27, 2023

Uh oh!

alen-z commented Jul 27, 2023

Uh oh!

k8s-ci-robot commented Aug 22, 2023

Uh oh!

alen-z commented Aug 22, 2023

Uh oh!

dgrisonnet commented Aug 23, 2023

Uh oh!

alen-z commented Aug 24, 2023

Uh oh!

dgrisonnet commented Aug 24, 2023

Uh oh!

k8s-ci-robot commented Aug 24, 2023

Uh oh!

DanTulovsky commented Aug 24, 2023

Uh oh!

dgrisonnet commented Aug 24, 2023

Uh oh!

dgrisonnet commented Aug 24, 2023

Uh oh!

DanTulovsky commented Aug 24, 2023

Uh oh!

dgrisonnet commented Aug 25, 2023

Uh oh!

DanTulovsky commented Aug 25, 2023

Uh oh!

alen-z commented Aug 27, 2023

Uh oh!

dgrisonnet commented Aug 28, 2023

Uh oh!

dgrisonnet commented Aug 28, 2023

Uh oh!

alen-z commented Aug 28, 2023

Uh oh!

dgrisonnet commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alen-z commented Jul 26, 2023 •

edited

Loading

linux-foundation-easycla bot commented Jul 26, 2023 •

edited

Loading

dgrisonnet commented Aug 28, 2023 •

edited

Loading