fix(metrics): run a separate task for utilization metric to ensure it is regularly updated #22070

esensar · 2024-12-21T12:51:38Z

Summary

This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component.

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

How did you test this PR?

Ran vector with internal metrics and observer that utilization was updated every ~5 secs, instead of only when events are running.

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

Please read our Vector contributor resources.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

Closes: Prometheus stats "stuck" on last value seen for transforms using aggregations (vector_utilization) #20216

Sponsored by Quad9

…e it is regularly published This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component. Fixes: vectordotdev#20216

esensar · 2024-12-21T12:52:32Z

I have left this as a draft, since I am not sure how to handle shutdown (which shutdown signal to use) and how to name the task (or maybe run it in a completely different way, to not mix it up with components).

Also, gauge is passed into the timer instead of using the macro inside the timer to ensure that correct labels are inherited from the tracing context.

esensar · 2025-01-09T16:24:31Z

@pront
Any suggestion for running this separate task? It is currently started as following:

running_topology.utilization_task =
    // TODO: how to name this custom task?
    Some(tokio::spawn(Task::new("".into(), "", async move {
        utilization_emitter
            .run_utilization(ShutdownSignal::noop())
            .await;
        // TODO: new task output type for this? Or handle this task in a completely
        // different way
        Ok(TaskOutput::Healthcheck)
    })));

I am not sure how to pass the shutdown signal to it (and if I should do it at all, it made sense to me, but I might have misunderstood some part of the topology). Also, I currently create a task with empty name, but maybe it would make more sense to run it in a different way compared to other tasks?

pront · 2025-01-09T20:50:55Z

Hi @esensar,

This is a complex so I checked out this PR to do some testing;

config:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json
      json:
        pretty: true

Sample output:

/Users/pavlos.rontidis/.cargo/bin/cargo run --color=always --profile dev -- --config /Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.70s
     Running `target/debug/vector --config /Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml`
2025-01-09T20:46:27.736727Z  INFO vector::app: Log level is enabled. level="info"
2025-01-09T20:46:27.741218Z  INFO vector::app: Loading configs. paths=["/Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml"]
2025-01-09T20:46:27.766384Z  INFO vector::topology::running: Running healthchecks.
2025-01-09T20:46:27.767489Z  INFO vector::topology::builder: Healthcheck passed.
2025-01-09T20:46:27.769222Z  INFO vector: Vector has started. debug="true" version="0.44.0" arch="aarch64" revision=""
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:27.770905Z",
  "kind": "absolute",
  "gauge": {
    "value": 1.0
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:27.770905Z",
  "kind": "absolute",
  "gauge": {
    "value": 1.0
  }
}
2025-01-09T20:46:27.777873Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=http://127.0.0.1:8686/playground graphql=http://127.0.0.1:8686/graphql
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:37.771882Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.010011816446046937
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:37.771882Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.01004418815411736
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:47.771505Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.0001184493997704478
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:47.771505Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.00010693227629135064
  }
}
...

Leaving this here as context. Will followup with more questions.

src/topology/builder.rs

src/topology/running.rs

pront · 2025-01-09T21:20:20Z

cc @lukesteensen (just in case you are interested in this one)

Co-authored-by: Pavlos Rontidis <[email protected]>

esensar · 2025-01-20T14:28:20Z

I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/

pront · 2025-01-24T15:28:43Z

I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/

I didn't have time to take a look at this yet. But I wouldn't be surprised if the validation framework also needs changes.

esensar · 2025-01-27T13:55:04Z

This was very weird to fix. I managed to get it to work by returning that original IntervalStream to the Utilization wrapper and just polling it and ignoring its results.

let _ = this.intervals.poll_next_unpin(cx);

I have no idea why this worked, but I hope that will help you understand the issue better @pront . Sorry about this, I don't really understand Rust streams that well.

esensar · 2025-03-06T17:10:07Z

Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄

pront · 2025-03-11T17:49:01Z

Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄

Hi @esensar, this is a complex change and I need more time to dive into the details. I plan to review this before the next release. In the meantime, if you had ideas for further improvements, feel free to update the PR.

esensar · 2025-03-11T18:35:49Z

Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄

Hi @esensar, this is a complex change and I need more time to dive into the details. I plan to review this before the next release. In the meantime, if you had ideas for further improvements, feel free to update the PR.

There is only the latest change I made to consider to change, but I haven't really had time to figure out an alternative. If I get some time, I will try to clean that part up.

johnhtodd · 2025-06-02T15:26:46Z

Hello Vector Dev folks - can this have a once-over sometime? We're still sometimes very stumped with strange results in our utilization graphs that cause our monitoring system to have severe indigestion.

pront · 2025-06-10T17:58:40Z

Hi and apologies for the delay on this PR. Can you please do a rebase and ensure the checks are passing? I will review shortly after.

thomasqueirozb · 2025-06-20T15:23:14Z

How did you verify this change locally? I wanted to run my own set of tests as well before we merge this one

esensar · 2025-06-20T16:36:35Z

How did you verify this change locally? I wanted to run my own set of tests as well before we merge this one

Hmm I can't find my exact configuration that I used back when I originally tested it. Let me try to set something up again, the general idea was to have a sink connected to a source that I can easily control (send data directly to manually, so that I can stop sending and still see utilization metric get published).

esensar · 2025-06-20T16:54:14Z

I have tested it with this now:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

  http:
    type: http_server
    address: 0.0.0.0:59001
    encoding: "text"
    headers: 
      - User-Agent


transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json

  console_http:
    inputs: ["http"]
    type: console
    encoding:
      codec: json

The console sink based on utilization metrics produces a bit too much output, but it can be seen that the value is changed every 5 seconds for console_http, regardless of data passing through it.

I sent data to it using:

curl -X POST localhost:59001 -d "test"

thomasqueirozb · 2025-06-23T14:52:46Z

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

esensar · 2025-06-23T16:24:23Z

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on master:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

  http:
    type: http_server
    address: 0.0.0.0:59001
    encoding: "text"
    headers: 
      - User-Agent


transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization" && .tags.component_id == "remap_http" # Just to reduce noise slightly

  remap_http:
    type: remap
    inputs: ["http"]
    source: .test = "test"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json

  console_http:
    inputs: ["remap_http"]
    type: console
    encoding:
      codec: json

pront · 2025-06-23T20:02:07Z

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on master:

This is interesting. Do we understand the root cause here?

esensar · 2025-06-24T09:27:45Z

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on master:

This is interesting. Do we understand the root cause here?

I am not sure why it worked with console sink. I can't remember what components I tested with originally, but I always had the issue of the utilization not being updated if data is not passing through. Looking at console sink, it doesn't seem to have some special behavior that would make it behave differently.

pront · 2025-07-02T18:13:50Z

I am not sure why it worked with console sink. I can't remember what components I tested with originally, but I always had the issue of the utilization not being updated if data is not passing through. Looking at console sink, it doesn't seem to have some special behavior that would make it behave differently.

Hi @esensar, I ran your config:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics
    scrape_interval_secs: 5

  http:
    type: http_server
    address: 0.0.0.0:59001
    encoding: "text"
    headers:
      - User-Agent


transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization" && .tags.component_id == "remap_http" # Just to reduce noise slightly

  remap_http:
    type: remap
    inputs: ["http"]
    source: .test = "test"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: "json"
      json:
        pretty: true

  console_http:
    inputs: ["remap_http"]
    type: console
    encoding:
      codec: json

and then:

./repeat_command.sh curl -X POST localhost:59001 -d "test"

And I noticed master branch and this PR have different behavior. On master it keeps the old value. But on your PR utilization drops when I stop publishing events, which is the desired behavior.

pront

Thank you!

…nsure it is regularly updated (#22070)" This reverts commit c4ace5b.

github-actions bot added the domain: topology Anything related to Vector's topology code label Dec 21, 2024

Add changelog entry

59b7d19

pront self-assigned this Jan 2, 2025

pront reviewed Jan 9, 2025

View reviewed changes

src/topology/builder.rs Outdated Show resolved Hide resolved

src/topology/builder.rs Outdated Show resolved Hide resolved

src/topology/running.rs Outdated Show resolved Hide resolved

src/topology/running.rs Outdated Show resolved Hide resolved

esensar and others added 3 commits January 10, 2025 13:49

Remove unnecessary clone when building utilization task

416e1cf

Co-authored-by: Pavlos Rontidis <[email protected]>

Name utilization task utilization_heartbeat

bb5d7ca

Join utilization_task when stopping topology

d7d8694

esensar requested a review from pront January 15, 2025 17:14

pront marked this pull request as ready for review January 15, 2025 19:07

pront requested a review from a team as a code owner January 15, 2025 19:07

esensar changed the title ~~fix(utilization_metric): run a separate task for utilization to ensure it is regularly published~~ fix(metrics): run a separate task for utilization metric to ensure it is regularly published Jan 17, 2025

esensar changed the title ~~fix(metrics): run a separate task for utilization metric to ensure it is regularly published~~ fix(metrics): run a separate task for utilization metric to ensure it is regularly updated Jan 17, 2025

Shutdown utilization task when stopping topology

db6f273

Hack: fix utilization never ending, by polling another stream?

e81af45

Credit Quad9DNS in changelog

c37f106

Merge remote-tracking branch 'origin/master' into fix/utilization-metric

fc1566f

Merge branch 'master' into fix/utilization-metric

295bc8f

esensar added 2 commits June 19, 2025 17:20

Wrap utilization timer logic in a separate sender

a6bc1b5

Rename start and stop fns

4efb580

Merge branch 'master' into fix/utilization-metric

64dbb89

Remove unused clippy check

e381c16

pront and others added 2 commits June 20, 2025 13:43

Merge branch 'master' into fix/utilization-metric

3e4aa96

Merge branch 'master' into fix/utilization-metric

72c1437

github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jun 20, 2025

Replace unwrap with expect in topology start

58f2ca9

esensar requested a review from thomasqueirozb June 23, 2025 07:38

Merge branch 'master' into fix/utilization-metric

bad119a

esensar requested a review from pront June 27, 2025 07:50

Merge branch 'master' into fix/utilization-metric

4775b5a

pront enabled auto-merge July 2, 2025 18:15

pront approved these changes Jul 2, 2025

View reviewed changes

thomasqueirozb approved these changes Jul 2, 2025

View reviewed changes

pront added this pull request to the merge queue Jul 2, 2025

auto-merge was automatically disabled July 2, 2025 19:37
Pull Request is not mergeable

Merged via the queue into vectordotdev:master with commit c4ace5b Jul 2, 2025
71 checks passed

Ilmarii mentioned this pull request Aug 25, 2025

Utilization metric related error #23641

Open

pront mentioned this pull request Oct 23, 2025

Vector utilization metric emits negative values after upgrade to v0.49.0 #24060

Open

thomasqueirozb added a commit that referenced this pull request Oct 23, 2025

Revert "fix(metrics): run a separate task for utilization metric to e…

cffbf20

…nsure it is regularly updated (#22070)" This reverts commit c4ace5b.

fix(metrics): run a separate task for utilization metric to ensure it is regularly updated #22070

fix(metrics): run a separate task for utilization metric to ensure it is regularly updated #22070

Conversation

esensar commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Is this a breaking change?

How did you test this PR?

Does this PR include user facing changes?

Checklist

References

Uh oh!

esensar commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esensar commented Jan 9, 2025

Uh oh!

pront commented Jan 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pront commented Jan 9, 2025

Uh oh!

esensar commented Jan 20, 2025

Uh oh!

pront commented Jan 24, 2025

Uh oh!

esensar commented Jan 27, 2025

Uh oh!

esensar commented Mar 6, 2025

Uh oh!

pront commented Mar 11, 2025

Uh oh!

esensar commented Mar 11, 2025

Uh oh!

johnhtodd commented Jun 2, 2025

Uh oh!

pront commented Jun 10, 2025

Uh oh!

thomasqueirozb commented Jun 20, 2025

Uh oh!

esensar commented Jun 20, 2025

Uh oh!

esensar commented Jun 20, 2025

Uh oh!

thomasqueirozb commented Jun 23, 2025

Uh oh!

esensar commented Jun 23, 2025

Uh oh!

pront commented Jun 23, 2025

Uh oh!

esensar commented Jun 24, 2025

Uh oh!

pront commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

esensar commented Dec 21, 2024 •

edited

Loading

esensar commented Dec 21, 2024 •

edited

Loading

pront commented Jul 2, 2025 •

edited

Loading