-
Notifications
You must be signed in to change notification settings - Fork 832
Closed
Description
Describe the bug
Store-gateway memory consumption and response time increases for /metrics
the longer the pod stays running
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (master@090988c40f3eec21623713dd4403b3bbd46175c6)
- Run store-gateway in shuffle sharding mode
- Ingest data for multiple tenants (upwards of 4000)
- Call
/metrics
on store-gateway
Expected behavior
- I expect the memory usage to be constant over time, and
/metrics
response time to stay the same
Environment:
- Infrastructure: kubernetes
- Deployment tool: helm
Storage Engine
- Blocks
- Chunks
Additional Context
What I think is happening:
- we are leaking
UserRegistry
when syncs blocks - when store-gateway closes a block, it will try to remove the per-user metrics
- when the removal happens, we check whether to keep a snapshot of the metric or not, based on metric type
- these snapshots are never released from memory
- we aggregate all the snapshots everytime
/metrics
is called
Potential solution:
do you think it makes sense to keep a "global expired metric" everytime we sync the blocks?
we can periodically aggregate all the metrics that could've been dropped, instead of keeping all of them and recalculate the metrics everytime. I would be happy to produce a PR if this solution works for you.
marianafranco, alanprot, harry671003 and anna-tran
Metadata
Metadata
Assignees
Labels
No labels