Skip to content

Conversation

@aarogoss
Copy link
Contributor

@aarogoss aarogoss commented Jun 9, 2025

What this PR does / why we need it:
Rather than relying on recording rules to monitor distributor lag metrics, this PR creates a new Prometheus counter in the push.go module.

This counter allow us to track the difference in time from when a distributor receives a log push request and the ingestion payload's most recent log timestamp.

This difference represents how far back in time the logs were captured, giving us insight into distributor "lag". If this counter's values remain steady or increase over time, we know the ingestion agents are falling behind and will eventually start dropping logs.

This counter metric has an additional label, "userAgent". This field is extracted from the HTTP request, providing insight into which ingestion agents are being used by a particular tenant. Should we see incoming log ingestion start to fall behind, we can use this label to provide instructions for customers to adjust the agent configuration specifically

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

aarogoss added 2 commits June 9, 2025 08:10
Rather than relying on recording rules to monitor distributor lag
metrics, this PR creates a new prometheus counter in the `push.go`
module.

This counter allow us to track the difference in time from when a
distributor receives a log push request and the ingestion payload's
most recent log timestamp.

This difference represents how far back in time the logs were captured,
giving us insight into distributor "lag".  If this counter's values
remain steady or increase over time, we know the ingestion agents are
falling behind and will eventually start dropping logs.

This counter metric has an additional label, "userAgent".  This field is
extracted from the HTTP request, providing insight into which ingestion
agents are being used by a particular tenant.  Should we see incoming
log ingestion start to fall behind, we can use this label to provide
instructions for customers to adjust the agent configuration specifically
@aarogoss aarogoss marked this pull request as ready for review June 9, 2025 22:31
@aarogoss aarogoss requested a review from a team as a code owner June 9, 2025 22:31
@aarogoss aarogoss enabled auto-merge (squash) June 10, 2025 15:22
@aarogoss aarogoss disabled auto-merge June 10, 2025 15:22
@aarogoss aarogoss merged commit 6495be0 into main Jun 10, 2025
65 checks passed
@aarogoss aarogoss deleted the agoss/add-distributor-lag-metric branch June 10, 2025 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants