-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix(metrics): run a separate task for utilization metric to ensure it is regularly updated #22070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e it is regularly published This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component. Fixes: vectordotdev#20216
|
I have left this as a draft, since I am not sure how to handle shutdown (which shutdown signal to use) and how to name the task (or maybe run it in a completely different way, to not mix it up with components). Also, gauge is passed into the timer instead of using the macro inside the timer to ensure that correct labels are inherited from the tracing context. |
|
@pront running_topology.utilization_task =
// TODO: how to name this custom task?
Some(tokio::spawn(Task::new("".into(), "", async move {
utilization_emitter
.run_utilization(ShutdownSignal::noop())
.await;
// TODO: new task output type for this? Or handle this task in a completely
// different way
Ok(TaskOutput::Healthcheck)
})));I am not sure how to pass the shutdown signal to it (and if I should do it at all, it made sense to me, but I might have misunderstood some part of the topology). Also, I currently create a task with empty name, but maybe it would make more sense to run it in a different way compared to other tasks? |
|
Hi @esensar, This is a complex so I checked out this PR to do some testing; config: Sample output: Leaving this here as context. Will followup with more questions. |
|
cc @lukesteensen (just in case you are interested in this one) |
|
I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/ |
I didn't have time to take a look at this yet. But I wouldn't be surprised if the validation framework also needs changes. |
|
This was very weird to fix. I managed to get it to work by returning that original let _ = this.intervals.poll_next_unpin(cx);I have no idea why this worked, but I hope that will help you understand the issue better @pront . Sorry about this, I don't really understand Rust streams that well. |
|
Hi @pront , does this look alright? Does it need further changes? My latest change has fixed issues with tests, but it is not ideal 😄 |
Hi @esensar, this is a complex change and I need more time to dive into the details. I plan to review this before the next release. In the meantime, if you had ideas for further improvements, feel free to update the PR. |
There is only the latest change I made to consider to change, but I haven't really had time to figure out an alternative. If I get some time, I will try to clean that part up. |
|
Hello Vector Dev folks - can this have a once-over sometime? We're still sometimes very stumped with strange results in our utilization graphs that cause our monitoring system to have severe indigestion. |
|
Hi and apologies for the delay on this PR. Can you please do a rebase and ensure the checks are passing? I will review shortly after. |
|
How did you verify this change locally? I wanted to run my own set of tests as well before we merge this one |
Hmm I can't find my exact configuration that I used back when I originally tested it. Let me try to set something up again, the general idea was to have a sink connected to a source that I can easily control (send data directly to manually, so that I can stop sending and still see utilization metric get published). |
|
I have tested it with this now: api:
enabled: true
sources:
internal_metrics_1:
type: internal_metrics
http:
type: http_server
address: 0.0.0.0:59001
encoding: "text"
headers:
- User-Agent
transforms:
filter_utilization:
type: filter
inputs: ["internal_metrics_1"]
condition: .name == "utilization"
sinks:
console:
inputs: ["filter_utilization"]
type: console
encoding:
codec: json
console_http:
inputs: ["http"]
type: console
encoding:
codec: jsonThe console sink based on utilization metrics produces a bit too much output, but it can be seen that the value is changed every 5 seconds for I sent data to it using: |
|
I'm trying to run your config from both |
Right, I just tested it myself and it worked fine, my bad. I guess it depends on the kind of component used. Here is a configuration that doesn't get updated on api:
enabled: true
sources:
internal_metrics_1:
type: internal_metrics
http:
type: http_server
address: 0.0.0.0:59001
encoding: "text"
headers:
- User-Agent
transforms:
filter_utilization:
type: filter
inputs: ["internal_metrics_1"]
condition: .name == "utilization" && .tags.component_id == "remap_http" # Just to reduce noise slightly
remap_http:
type: remap
inputs: ["http"]
source: .test = "test"
sinks:
console:
inputs: ["filter_utilization"]
type: console
encoding:
codec: json
console_http:
inputs: ["remap_http"]
type: console
encoding:
codec: json |
This is interesting. Do we understand the root cause here? |
I am not sure why it worked with console sink. I can't remember what components I tested with originally, but I always had the issue of the utilization not being updated if data is not passing through. Looking at |
Hi @esensar, I ran your config: api:
enabled: true
sources:
internal_metrics_1:
type: internal_metrics
scrape_interval_secs: 5
http:
type: http_server
address: 0.0.0.0:59001
encoding: "text"
headers:
- User-Agent
transforms:
filter_utilization:
type: filter
inputs: ["internal_metrics_1"]
condition: .name == "utilization" && .tags.component_id == "remap_http" # Just to reduce noise slightly
remap_http:
type: remap
inputs: ["http"]
source: .test = "test"
sinks:
console:
inputs: ["filter_utilization"]
type: console
encoding:
codec: "json"
json:
pretty: true
console_http:
inputs: ["remap_http"]
type: console
encoding:
codec: jsonand then: ./repeat_command.sh curl -X POST localhost:59001 -d "test"And I noticed master branch and this PR have different behavior. On master it keeps the old value. But on your PR utilization drops when I stop publishing events, which is the desired behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Pull Request is not mergeable
Summary
This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component.
Change Type
Is this a breaking change?
How did you test this PR?
Ran vector with internal metrics and observer that utilization was updated every ~5 secs, instead of only when events are running.
Does this PR include user facing changes?
Checklist
Cargo.lock), pleaserun
dd-rust-license-tool writeto regenerate the license inventory and commit the changes (if any). More details here.References