fix(v2): ignore client timeouts in write-path circuit breaker #3858
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, client timeouts may trigger the segment-writer client circuit breaker, causing a specific segment-writer service instance to be temporarily removed from the connection pool. On one hand, this behavior is justified because if one of the instances is temporarily overloaded (e.g., due to a burst), we want to send fewer requests to it.
In practice, however, this is a very rare case, and the adaptive load balancing we have is usually fast enough to prevent hot spots. What we may observe instead is that either all the instances are overloaded (evenly) or the client (the distributor) itself is overloaded. In such cases, the availability of the write path might degrade due to circuit breaker flapping. Another interesting side effect is an increase in the number of segments created (because more than one segment-writer will be uploading data for each shard), leading to additional pressure on the compaction process.