Skip to content

Conversation

@esensar
Copy link
Contributor

@esensar esensar commented Dec 21, 2024

Summary

This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Ran vector with internal metrics and observer that utilization was updated every ~5 secs, instead of only when events are running.

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • Please read our Vector contributor resources.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References


Sponsored by Quad9

…e it is regularly published

This adds a separate task that runs periodically to emit utilization metrics and collect messages
from components that need their utilization metrics calculated. This ensures that utilization metric
is published even when no events are running through a component.

Fixes: vectordotdev#20216
@github-actions github-actions bot added the domain: topology Anything related to Vector's topology code label Dec 21, 2024
@esensar
Copy link
Contributor Author

esensar commented Dec 21, 2024

I have left this as a draft, since I am not sure how to handle shutdown (which shutdown signal to use) and how to name the task (or maybe run it in a completely different way, to not mix it up with components).

Also, gauge is passed into the timer instead of using the macro inside the timer to ensure that correct labels are inherited from the tracing context.

@pront pront self-assigned this Jan 2, 2025
@esensar
Copy link
Contributor Author

esensar commented Jan 9, 2025

@pront
Any suggestion for running this separate task? It is currently started as following:

running_topology.utilization_task =
    // TODO: how to name this custom task?
    Some(tokio::spawn(Task::new("".into(), "", async move {
        utilization_emitter
            .run_utilization(ShutdownSignal::noop())
            .await;
        // TODO: new task output type for this? Or handle this task in a completely
        // different way
        Ok(TaskOutput::Healthcheck)
    })));

I am not sure how to pass the shutdown signal to it (and if I should do it at all, it made sense to me, but I might have misunderstood some part of the topology). Also, I currently create a task with empty name, but maybe it would make more sense to run it in a different way compared to other tasks?

@pront
Copy link
Member

pront commented Jan 9, 2025

Hi @esensar,

This is a complex so I checked out this PR to do some testing;

config:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json
      json:
        pretty: true

Sample output:

/Users/pavlos.rontidis/.cargo/bin/cargo run --color=always --profile dev -- --config /Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.70s
     Running `target/debug/vector --config /Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml`
2025-01-09T20:46:27.736727Z  INFO vector::app: Log level is enabled. level="info"
2025-01-09T20:46:27.741218Z  INFO vector::app: Loading configs. paths=["/Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml"]
2025-01-09T20:46:27.766384Z  INFO vector::topology::running: Running healthchecks.
2025-01-09T20:46:27.767489Z  INFO vector::topology::builder: Healthcheck passed.
2025-01-09T20:46:27.769222Z  INFO vector: Vector has started. debug="true" version="0.44.0" arch="aarch64" revision=""
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:27.770905Z",
  "kind": "absolute",
  "gauge": {
    "value": 1.0
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:27.770905Z",
  "kind": "absolute",
  "gauge": {
    "value": 1.0
  }
}
2025-01-09T20:46:27.777873Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=http://127.0.0.1:8686/playground graphql=http://127.0.0.1:8686/graphql
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:37.771882Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.010011816446046937
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:37.771882Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.01004418815411736
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:47.771505Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.0001184493997704478
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:47.771505Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.00010693227629135064
  }
}
...

Leaving this here as context. Will followup with more questions.

@pront
Copy link
Member

pront commented Jan 9, 2025

cc @lukesteensen (just in case you are interested in this one)

@esensar esensar requested a review from pront January 15, 2025 17:14
@pront pront marked this pull request as ready for review January 15, 2025 19:07
@pront pront requested a review from a team as a code owner January 15, 2025 19:07
@esensar esensar changed the title fix(utilization_metric): run a separate task for utilization to ensure it is regularly published fix(metrics): run a separate task for utilization metric to ensure it is regularly published Jan 17, 2025
@esensar esensar changed the title fix(metrics): run a separate task for utilization metric to ensure it is regularly published fix(metrics): run a separate task for utilization metric to ensure it is regularly updated Jan 17, 2025
@esensar
Copy link
Contributor Author

esensar commented Jan 20, 2025

I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/

@pront
Copy link
Member

pront commented Jan 24, 2025

I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/

I didn't have time to take a look at this yet. But I wouldn't be surprised if the validation framework also needs changes.

@esensar
Copy link
Contributor Author

esensar commented Jan 27, 2025

This was very weird to fix. I managed to get it to work by returning that original IntervalStream to the Utilization wrapper and just polling it and ignoring its results.

let _ = this.intervals.poll_next_unpin(cx);

I have no idea why this worked, but I hope that will help you understand the issue better @pront . Sorry about this, I don't really understand Rust streams that well.

@esensar
Copy link
Contributor Author

esensar commented Mar 6, 2025

Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄

@pront
Copy link
Member

pront commented Mar 11, 2025

Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄

Hi @esensar, this is a complex change and I need more time to dive into the details. I plan to review this before the next release. In the meantime, if you had ideas for further improvements, feel free to update the PR.

@esensar
Copy link
Contributor Author

esensar commented Mar 11, 2025

Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄

Hi @esensar, this is a complex change and I need more time to dive into the details. I plan to review this before the next release. In the meantime, if you had ideas for further improvements, feel free to update the PR.

There is only the latest change I made to consider to change, but I haven't really had time to figure out an alternative. If I get some time, I will try to clean that part up.

@johnhtodd
Copy link

Hello Vector Dev folks - can this have a once-over sometime? We're still sometimes very stumped with strange results in our utilization graphs that cause our monitoring system to have severe indigestion.

@pront
Copy link
Member

pront commented Jun 10, 2025

Hi and apologies for the delay on this PR. Can you please do a rebase and ensure the checks are passing? I will review shortly after.

@thomasqueirozb
Copy link
Contributor

How did you verify this change locally? I wanted to run my own set of tests as well before we merge this one

@esensar
Copy link
Contributor Author

esensar commented Jun 20, 2025

How did you verify this change locally? I wanted to run my own set of tests as well before we merge this one

Hmm I can't find my exact configuration that I used back when I originally tested it. Let me try to set something up again, the general idea was to have a sink connected to a source that I can easily control (send data directly to manually, so that I can stop sending and still see utilization metric get published).

@esensar
Copy link
Contributor Author

esensar commented Jun 20, 2025

I have tested it with this now:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

  http:
    type: http_server
    address: 0.0.0.0:59001
    encoding: "text"
    headers: 
      - User-Agent


transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json

  console_http:
    inputs: ["http"]
    type: console
    encoding:
      codec: json

The console sink based on utilization metrics produces a bit too much output, but it can be seen that the value is changed every 5 seconds for console_http, regardless of data passing through it.

I sent data to it using:

curl -X POST localhost:59001 -d "test"

@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jun 20, 2025
@esensar esensar requested a review from thomasqueirozb June 23, 2025 07:38
@thomasqueirozb
Copy link
Contributor

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

@esensar
Copy link
Contributor Author

esensar commented Jun 23, 2025

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on master:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

  http:
    type: http_server
    address: 0.0.0.0:59001
    encoding: "text"
    headers: 
      - User-Agent


transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization" && .tags.component_id == "remap_http" # Just to reduce noise slightly

  remap_http:
    type: remap
    inputs: ["http"]
    source: .test = "test"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json

  console_http:
    inputs: ["remap_http"]
    type: console
    encoding:
      codec: json

@pront
Copy link
Member

pront commented Jun 23, 2025

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on master:

This is interesting. Do we understand the root cause here?

@esensar
Copy link
Contributor Author

esensar commented Jun 24, 2025

I'm trying to run your config from both master and your branch and then checking for changes. I'm seeing utilization metrics coming through for "component_id":"console_http" in both versions and the value is also being changed despite no data being sent to the source. Can you run the same test I'm running and tell me what I should expect to see differently?

Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on master:

This is interesting. Do we understand the root cause here?

I am not sure why it worked with console sink. I can't remember what components I tested with originally, but I always had the issue of the utilization not being updated if data is not passing through. Looking at console sink, it doesn't seem to have some special behavior that would make it behave differently.

@esensar esensar requested a review from pront June 27, 2025 07:50
@pront
Copy link
Member

pront commented Jul 2, 2025

I am not sure why it worked with console sink. I can't remember what components I tested with originally, but I always had the issue of the utilization not being updated if data is not passing through. Looking at console sink, it doesn't seem to have some special behavior that would make it behave differently.

Hi @esensar, I ran your config:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics
    scrape_interval_secs: 5

  http:
    type: http_server
    address: 0.0.0.0:59001
    encoding: "text"
    headers:
      - User-Agent


transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization" && .tags.component_id == "remap_http" # Just to reduce noise slightly

  remap_http:
    type: remap
    inputs: ["http"]
    source: .test = "test"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: "json"
      json:
        pretty: true

  console_http:
    inputs: ["remap_http"]
    type: console
    encoding:
      codec: json

and then:

./repeat_command.sh curl -X POST localhost:59001 -d "test"

And I noticed master branch and this PR have different behavior. On master it keeps the old value. But on your PR utilization drops when I stop publishing events, which is the desired behavior.

@pront pront enabled auto-merge July 2, 2025 18:15
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@pront pront added this pull request to the merge queue Jul 2, 2025
auto-merge was automatically disabled July 2, 2025 19:37

Pull Request is not mergeable

Merged via the queue into vectordotdev:master with commit c4ace5b Jul 2, 2025
71 checks passed
thomasqueirozb added a commit that referenced this pull request Oct 23, 2025
…nsure it is regularly updated (#22070)"

This reverts commit c4ace5b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: topology Anything related to Vector's topology code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus stats "stuck" on last value seen for transforms using aggregations (vector_utilization)

4 participants