Potential memory leak on store-gateway metrics

**Describe the bug**
Store-gateway memory consumption and response time increases for `/metrics` the longer the pod stays running

**To Reproduce**
Steps to reproduce the behavior:
1. Start Cortex (master@090988c40f3eec21623713dd4403b3bbd46175c6)
2. Run store-gateway in shuffle sharding mode
3. Ingest data for multiple tenants (upwards of 4000)
4. Call `/metrics` on store-gateway

**Expected behavior**
* I expect the memory usage to be constant over time, and `/metrics` response time to stay the same

**Environment:**
 - Infrastructure: kubernetes
 - Deployment tool: helm

**Storage Engine**
- [x] Blocks
- [ ] Chunks

**Additional Context**

What I think is happening:
* we are leaking `UserRegistry` when syncs blocks
* when store-gateway closes a block, it will try to [remove the per-user metrics](https://github.com/cortexproject/cortex/blob/95a407fa6ffa843bb6331b0a42afe0e77887c8b0/pkg/storegateway/bucket_stores.go#L364)
* when the removal happens, we check whether to [keep a snapshot of the metric or not, based on metric type](https://github.com/cortexproject/cortex/blob/95a407fa6ffa843bb6331b0a42afe0e77887c8b0/pkg/util/metrics_helper.go#L616)
* these snapshots are [never released from memory](https://github.com/cortexproject/cortex/blob/95a407fa6ffa843bb6331b0a42afe0e77887c8b0/pkg/util/metrics_helper.go#L594)
* we [aggregate all the snapshots](https://github.com/cortexproject/cortex/blob/e06154a00682fcd29f8ef5483e1f1bda29616a2c/pkg/util/metrics_helper.go#L659) everytime  `/metrics` is called

Potential solution:
do you think it makes sense to keep a "global expired metric" everytime we sync the blocks? 

we can periodically aggregate all the metrics that could've been dropped, instead of keeping all of them and recalculate the metrics everytime. I would be happy to produce a PR if this solution works for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential memory leak on store-gateway metrics #4451

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential memory leak on store-gateway metrics #4451

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions