Skip to content

Conversation

@abc3
Copy link
Contributor

@abc3 abc3 commented Jul 1, 2025

This PR introduces monitoring for large process heaps in Supavisor.ErlSysMon using the :large_heap system flag. The threshold is set to 25MB and is computed using the new helper function Supavisor.Helpers.mb_to_words/1.

It enables detection of processes that grow unusually large after garbage collection, helping to catch potential memory issues early.

Test coverage for Supavisor.ErlSysMon has also been added.

Updated:

  • Added monitoring for long message queues using the :long_message_queue system flag:
{:long_message_queue, {0, 1_000}}

@abc3 abc3 requested a review from a team as a code owner July 1, 2025 13:34
@v0idpwn v0idpwn enabled auto-merge (squash) July 1, 2025 13:44
@v0idpwn v0idpwn disabled auto-merge July 1, 2025 15:11
@abc3 abc3 requested a review from v0idpwn July 1, 2025 15:19
@abc3 abc3 changed the title feat: monitor large process heaps using :large_heap feat: monitor large process heaps and long message queues Jul 1, 2025
@v0idpwn v0idpwn merged commit 0d0eb43 into supabase:main Jul 1, 2025
19 of 22 checks passed
@v0idpwn
Copy link
Member

v0idpwn commented Jul 1, 2025

Thank you!

@abc3 abc3 deleted the feat/large_heap branch July 1, 2025 17:55
@v0idpwn v0idpwn mentioned this pull request Jul 28, 2025
v0idpwn added a commit that referenced this pull request Jul 29, 2025
### Features
- **Authentication cleartext password support** - Added support for
cleartext password authentication method (#707)
- **Runtime-configurable connection retries** - Support for runtime
configuration of connection retries and infinite retries (#705)
- **Enhanced health checks** - Check database and eRPC capabilities
during health check operations (#691)
- **More consistency with postgres on auth errors** - Improves errors in
some client libraries (#711)

### Performance Improvements

- **Optimized ranch usage** - Supavisor now uses a constant number of
ranch instances for improved performance and resource management when
hosting a large number of pools (#706)

### Monitoring

- **New OS memory metrics** - gives a more accurate picture of memory
usage (#704)
- **Add a promex plugin for cluster metrics** - for tracking latency and
connection status (#690)
- **Client connection lifetime metrics** - adds a metric about how long
each connection is connected for (#688)
- **Process monitoring** - Log when large process heaps and long message
queues (#689)

### Bug Fixes

- **Client handler query cancellation** - Fixed handling of
`:cancel_query` when state is `:idle` (#692)

### Migration Notes

- Instances running a small number of pools may see an increase in
memory usage. This can be mitigated by changing the ranch shard or the
acceptor counts.
- If using any of the new used ports, may need to change the defaults
- Review monitoring dashboards and include new metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants