Skip to content

Conversation

@justinrixx
Copy link
Owner

@justinrixx justinrixx commented Jun 2, 2024

I have been reading the Google SRE book and found this chapter on handling server overload very interesting. The whole chapter is worth a read, but the summary is:

  • A server entering overload is not good. It often creates a feedback loop where falling behind on requests leads to queuing them in memory, which creates memory pressure, which results in more GC runs, which reduces the amount of cpu resources for actually serving requests, which leads to falling further behind on requests.
  • Because of this, getting a server to recover from overload requires reducing the proportional load significantly (e.g. if running at 110% of its maximum reqs/sec caused the system to enter overload, it may require traffic to be as low as 20-30% to fully recover).
  • A caller retrying during overload can greatly exacerbate the feedback loop and delay recovery.

The book outlines several overload mitigations in place at Google. Some of these require tighter integration with different parts of the system (standardized request headers, resource utilization reporting, etc), but there are 2 which are relevant to this package and can be implemented without controlling more of the stack:

  • Adaptively circuit-breaking requests to a system showing signs of overload (see the equation described here)
  • Implementing a per-client retry budget as a proportion of all requests (described in the Deciding to Retry section later on in the same chapter).

Nothing on this branch in final; the interfaces will likely change, but I wanted to gather some initial thoughts and get feedback. My idea is to introduce a new interface: Throttler which signals when to skip a request. By default, the roundtripper won't use any throttling because it uses memory and cpu so it should be totally optional. A new transport instantiation option function will be created for a consumer to specify a throttler. It doesn't make sense to allow setting a throttler on the context though, so that won't be an option (although maybe there should be a context key to bypass a previously-specified throttler on a per-request basis).

My draft default throttler tracks 3 counts over a sliding time window: total requests, number of requests that came back with an "overloaded" response, and the number of retries. Keeping a sum over a sliding window is a little bit tricky. Since every request goes through the throttler, I wanted to avoid memory allocation and mutex contention. My draft solution is the atomicCounter type in counter.go. It breaks the time window into n buckets. Increments just perform an atomic increment on the currently active bucket, and "reading" the current count just sums over each bucket, performing an atomic load on each entry. The last piece is a time.Ticker which periodically atomically zeroes out the oldest bucket, then atomically moves the index to the new bucket. This trades granularity for performance, which I think is the right choice here.

@justinrixx
Copy link
Owner Author

@wburningham I'm interested in your thoughts on this

Copy link

@wburningham wburningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is a good idea.

// Throttler is an interface decides if a retry should be throttled
Throttler interface {
ShouldThrottle(attempt Attempt) bool
RecordStats(attempt Attempt)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this part of the interface (I can't put my finger on it). I understand it's purpose, but I wish there was a way around it.

If the method does stay in the final implementation I think it could be helpful to explain when in a request lifecycle this method is called.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll think about this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants