Skip to content

Conversation

@blink1073
Copy link
Member

Please complete the following before merging:

  • Is the relevant DRIVERS ticket in the PR title?

@blink1073 blink1073 changed the title DRIVERS-318 Avoid clearing the connection pool when the server connection rate limiter triggers DRIVERS-3218 Avoid clearing the connection pool when the server connection rate limiter triggers Oct 28, 2025
@blink1073
Copy link
Member Author

I'll get all of the tests passing in mongodb/mongo-python-driver#2598 and then include them in this PR.

- A successful heartbeat does NOT change the state of the pool.
- A failed heartbeat clears the pool.
- A subsequent failed connection will increase the backoff attempt.
- A successful connection will return the Pool to ready state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a description of the exponential backoff + jitter for the backoff duration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define the backoff and jitter policy in one place and link to it? If so, should I add it in this PR and where?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's small and simple enough that it should be defined here alongside where it will be used.

@@ -1,40 +1,38 @@
version: 1
style: integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't modify spec tests like this anymore, because that will break for drivers using submodules to track spec tests who haven't implemented backoff yet, and if those drivers then skip these tests they lose coverage.

Comment on lines +88 to +93
- poolBackoffEvent: {}
- poolBackoffEvent: {}
- poolBackoffEvent: {}
- poolBackoffEvent: {}
- poolBackoffEvent: {}
- poolBackoffEvent: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, and the tests below: where are all these pool backoff events coming from? I get zero / one (depending on my implementation, see below) of them in Node.

This is related to my comment in the design. Even if I align my implementation with the design, I only get one backoff event because there are no establishment attempts:

  • there are no requests in the wait queue
  • minPoolSize is 0, so no background thread population

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is these extra backoff attempts occur because of the new retry logic, which is already implemented in the branch Steve is using to test these changes in python. Is there a way we can test this without tying the two projects together? Otherwise drivers will need to implement the retry logic first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the retry logic come into play here? I thought we only retried commands with a SystemOverloadError error label. And the insert does have an expectError - so the insert fails on Steve's branch as well.

Copy link
Member

@ShaneHarvey ShaneHarvey Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to how the driver labels connection errors with "RetryableWriteError" for retryable writes, we add the "RetryableError" and "SystemOverloadedError" labels to these errors so that the command can later be retried. We'll need to clarify this rule in this PR.

Comment on lines +288 to +291
connection checkout fails under conditions that indicate server overload. The rules for entering backoff mode are as
follows: - A network error or network timeout during the TCP handshake or the `hello` message for a new connection
MUST trigger the backoff state. - Other pending connections MUST not be canceled. - In the case of multiple pending
connections, the backoff attempt number MUST only be incremented once. This can be done by recording the state prior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking with @ShaneHarvey about this (related comments on drivers ticket). Shane's understanding is that we decided to include all timeout errors, regardless of where it originated, during connection establishment. Does that match your understanding, Steve?

And related; the design says:

After a connection establishment failure the pool enters the PoolBackoff state.

We should update the design with whatever the outcome of this thread is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants