- 
                Notifications
    You must be signed in to change notification settings 
- Fork 246
DRIVERS-2884: CSOT avoid connection churn when operations timeout #1845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
f973c84    to
    f3d26ba      
    Compare
  
    | Assigned  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the unified test schema changes alone and those LGTM. I'll defer to CSOT spec folks for the other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanych-sun Have we implemented these changes in any driver? Which driver is supposed to be the second implementer? iirc it was originally Python
| close connection | ||
| connection = Null | ||
| if connection is in "pending response" state: | ||
| drain the pending response | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have pseudocode for "drain the pending response" somewhere? I don't see it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have events emission in the current pseudocode, not sure if we need to have pseudocode for "drain the pending response", as it basically "consume the bytes from underling stream/socket and ignore them".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like the pseudocode, it doesn't seem terrible involved to add and it would be nice to codify the calculation of the timeout for the pending read in pseudocode:
read_timeout =  timeoutMS set 
    ? csotMin(timeoutMS, remaining static timeout) 
    : waitQueueTimeoutMS set 
        ? csotMin(waitQueueTimeoutMS, remaining static timeout) 
        : remaining static timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout calculation was added. I've introduced calculation for entire check out accordingly to CSOT spec (please find checkout_timeout variable) together with draining_timeout which is calculated as min (remaining of checkout_timeout, remaining of expiration window).
        
          
                source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | /** | ||
| * The driver-generated request ID of the operation that caused the pending response state. | ||
| */ | ||
| requestId: int64; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd that the events emitted for an operation include a requestId that corresponds to the request that made the connection pending response.  What is the value in this datapoint?
If we decide to keep it - do you think a more precise name would be beneficial?  requestId is ambiguous - users could easily assume it refers to the request that is reading the pending response, not the operation that made the connection pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of reporting the original requestId is to keep tracking how long/how many draining attempt were made before success or failure. Honestly I would prefer to have BOTH, current request Id and "original timed out requestId", but it could make sense only if other check out event had "current requestId" field, which is not there =(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might depend on driver internals but in Python the new request ID will not be available at this point because the command has not been serialized (we need to checkout the connection first).
As for including the old requestId, I'm not sure how useful it is but it seems harmless to add. The driver needs to validate requestId == responseTo field on the server reply using the old requestId anyway to make sure the wire protocol isn't violated so that value will be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Shane! I suppose we want to keep the field. @baileympearson is this OK with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fine to keep it - thoughts on workshopping the name to make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open for suggestions on better name.
        
          
                source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
          
            Show resolved
            Hide resolved
        
              
          
                source/client-side-operations-timeout/tests/pending-response.yml
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                source/client-side-operations-timeout/tests/pending-response.yml
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | - `sendBytes`: We have 3 possible states here: | ||
| 1. Message size was partially read: random value between 1 and 3 inclusive | ||
| 2. Message size was read, body was not read at all: use 4 | ||
| 3. Message size was read, body read partially: random value between 5 and 100 inclusive | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The insert is the first command on the connection, so where are the values for sendBytes coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the parameter for proxy, which specifies how many bytes should be streamed, before sleeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of sendBytes is determined based on the state of the connection.  But this is the first operation on the connection, so the connection will never have anything in the buffer and it isn't clear what the value of sendBytes should be.
Unless I misunderstand what sub-bullets 1,2, and 3 mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That 3 sub-bullets - it's different test cases for the same test scenario. This is parameter for proxy, that define how many bytes of the server's response should be streamed instantly, before delay.
I rephrased a little, hope it's more readable now.
| This test verifies that if only part of a response was read before the timeout, the driver can drain the rest of the | ||
| response and reuse the connection for the next operation. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the actual contents of the test, so either the description or the test needs to be updated.
But if its the description that is inaccurate - with sendAll: true and the events (only one pending read started + finished pair), the full response will be read in the first operation.  Don't we have coverage for this scenario from our unified tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using proxy server here to control when/how server response arrived to the client side. So in step 2 in addition to the regular insert command payload, we have to add additional proxyTest property, which instruct proxy to emulate timeout. It works as following:
- it send request to the server
- based on the sendBytesparameter it stream requested amount of bytes instantly
- it sleeps for delayMS
- it streams rest of the response.
I think it makes sense to add this explanation to the test summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, maybe my confusion stems from the interaction between sendBytes and sendAll.  Does sendAll being enabled not mean that the proxy will forward the full response back to the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've updated steps with sample of payload, it might help to understand the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the example of proxyTest parameter:
 proxyTest: {
          actions: [
             { sendBytes : 2 },
             { delayMs : 400 },
             { sendAll : true },
          ]
       }
Which can be read as:
Hey proxy, here are steps for you, do it one-by-one:
action 1: stream 2 bytes from the server response
action 2: wait for 400ms
action 3: stream rest of the response to the client.
…-and-pooling.md Co-authored-by: Bailey Pearson <[email protected]>
Co-authored-by: Bailey Pearson <[email protected]>
| tConnectionDrainingStarted = current instant (use a monotonic clock if possible) | ||
| emit PendingResponseStartedEvent and equivalent log message | ||
| drain the pending response | ||
| if error: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mix of exception/error paradigms in this pseudocode block is a bit confusing. We should rewrite this new pseudocode to use try/catch like the existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten, thank you!
| some equivalent configuration, but this configuration will also require target frameworks higher than or equal to .net | ||
| 5.0. The advantage of using Background Thread to manage perished connections is that it will work regardless of | ||
| environment setup. | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a rationale question to cover the motivation for this feature? Like "Why introduce the draining pending responses?" where the answer is to reduce connection churn in cases when the configured maxTimeMS does not allow enough time for the driver to read the MaxTimeMSExpired error and cases where the server or network delays the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major from me! Still waiting on a second implementation.
        
          
                source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | close connection | ||
| connection = Null | ||
| if connection is in "pending response" state: | ||
| drain the pending response | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like the pseudocode, it doesn't seem terrible involved to add and it would be nice to codify the calculation of the timeout for the pending read in pseudocode:
read_timeout =  timeoutMS set 
    ? csotMin(timeoutMS, remaining static timeout) 
    : waitQueueTimeoutMS set 
        ? csotMin(waitQueueTimeoutMS, remaining static timeout) 
        : remaining static timeout
        
          
                source/client-side-operations-timeout/tests/pending-response-close-connection.yml
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                source/client-side-operations-timeout/tests/pending-response-close-connection.yml
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | /** | ||
| * The driver-generated request ID of the operation that caused the pending response state. | ||
| */ | ||
| requestId: int64; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fine to keep it - thoughts on workshopping the name to make it clearer?
        
          
                source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | tConnectionDrainingStarted = current instant (use a monotonic clock if possible) | ||
| emit PendingResponseStartedEvent and equivalent log message | ||
| try: | ||
| drain the pending response | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pseudocode notes above that drivers should only allow one thread to run the check-out process at a time:
Note that in a lock-based implementation of the wait queue would only allow one thread in the following block at a time
Reading connections in "pending response" state for 3 seconds while holding an exclusive lock could introduce a huge bottleneck in multi-threaded drivers. Is there a way to move that pending response read out of the locked block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have the same issue for create connection as well? How we fix the issue with create connection in lock based implementation? I suppose we have to apply the same fix/workaround to pending response draining as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that create connection is a non-blocking operation that creates a connection in the "pending" state. The establish connection in the block below is where the blocking work is done. That block specifically notes:
This MUST NOT block other threads from acquiring connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add note like we do in the maxConnecting case?:
        # this waiting MUST NOT prevent other threads from checking Connections
        # back in to the pool.
        wait until pendingConnectionCount < maxConnecting or a connection is available
        continue
Like
            # this I/O MUST NOT block other threads performing connection checkout
            drain the pending response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned adding that note glosses over the complexity of implementing such a locking strategy. However, I also realize that the pseudocode is primarily for communicating the process and may not resemble actual driver code for checkOut.
Would a change like the following work?
...
try:
  enter WaitQueue
  wait until at top of wait queue
  while connection is Null:
    # Note that in a lock-based implementation of the wait queue would
    # only allow one thread in the following block at a time
    if a connection is available:
      while connection is Null and a connection is available:
        connection = next available connection
        if connection is perished:
          close connection
          connection = Null
    else if totalConnectionCount < maxPoolSize:
      if pendingConnectionCount < maxConnecting:
        connection = create connection
      else:
        # this waiting MUST NOT prevent other threads from checking Connections
        # back in to the pool.
        wait until pendingConnectionCount < maxConnecting or a connection is available
        continue
    if connection is in "pending response" state:
      tConnectionDrainingStarted = current instant (use a monotonic clock if possible)
      emit PendingResponseStartedEvent and equivalent log message
      # this I/O MUST NOT block other threads performing connection checkout
      drain the pending response
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewdale Bailey and I agreed to adjust the code very close to what you are suggesting here. With the only difference: enter WaitQueue will be under the outer loop. I'll push new version of the pseudocode soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Outer loop was added to support draining of the pending response outside of wait queue .
@matthewdale @baileympearson Could you please re-review the checkout  pseudocode?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The pseudocode makes it clearer that a checkOut draining a pending response doesn't get to keep its place in the wait queue and must re-enter the wait queue if the draining times out.
| if last read timestamp on connection > 3 seconds old | ||
| close connection | ||
| else | ||
| check in the connection | ||
| connection = Null | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an operation with no timeout picks up a "pending response" connection, are there cases where that operation can wait an indefinitely long time to get a connection?
Consider the scenario:
There is one connection in a connection pool. The one connection is in the  "pending response" state, where the connection is consistently but very slowly receiving data from the server (at least 1 byte every 3 seconds).
- An operation with no timeout tries to check out a connection.
- The first connection in the stack is the above "pending response" connection. The driver tries to drain the pending response.
- Draining the pending response takes longer than 3 seconds, so the connection is checked back into the pool.
- The driver selects the next connection, which is the same connection. Go to step 2 and repeat.
Questions:
A. Is that scenario realistic?
B. For drivers that support the deprecated "waitQueueTimeoutMS", should they apply that timeout to the above check-out process?
C. For drivers that no longer support the deprecated "waitQueueTimeoutMS", should they apply "serverSelectionTimeoutMS" to the above check-out process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to design a solution to handle the "very slowly receiving data" problem under the assumption that it's extremely unlikely to occur. However that question has come up often enough that we should add justification/rationale.
That scenario could be manufactured by an attacker that controls the network but in that case they can already DOS the system by blocking all traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed changes DO NOT change any requirements regarding the timeouts for the whole checkout process. For Drivers that supports waitQueueTimeoutMS - we should use it to limit checkout time, otherwise it's remaining of serverSelectionTimeoutMS. Regarding the draining of the pending response timeout, there is requirement to use Default timeout if no timeout was provided:
- Default timeout: If no user-provided timeout is specified, the driver MUST use the minimum of (a) the remaining 3 second "pending response" window and (b) remaining timeout for connection checkout.
What means there is NO unlimited amount of retries, even if the connection will be checked in back into the connection pool, we whole process cannot take longer then timeout prescribed for connection checkout.
Answers to the questions:
A. As Shane stated - even though this sounds like a potential problem, we are assuming this is unrealistic scenario for now. In order to mitigate the problem we could introduce some additional timeout to limit time from the "first-byte" to the "whole response read", but not sure if we should do that just now.
B./C. I believe it was mentioned that we should respect remaining waitQueueTimeoutMS or serverSelectionTimeoutMS for draining pending response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. @sanych-sun can you add an explicit note that waitQueueTimeoutMS or serverSelectionTimeoutMS must bound connection checkout? Maybe add a link to the relevant section about using serverSelectionTimeoutMS for connection checkout from the CSOT spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout calculations added to the pseudocode.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the note about the timeout! I checked and confirmed that the Go Driver uses serverSelectionTimeoutMS to bound conn checkout currently, but it's easy to miss and more important with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to the unified test files LGTM. Remember to change the changelog date before merging this PR.
|  | ||
| ## Changelog | ||
|  | ||
| - 2025-09-25: **Schema version 1.28**. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to change the date before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
This PR is based on the Preston's work in another PR that was closed by mistake: #1675
This PR implements the design for connection pooling improvements described in DRIVERS-2884, based on the CSOT (Client-Side Operation Timeout) spec. It addresses connection churn caused by network timeouts during operations, especially in environments with low client-side timeouts and high latency.
When a connection is checked out after a network timeout, the driver now attempts to resume and complete reading any pending server response (instead of closing and discarding the connection). This may require multiple connection to be attempted during the connection check out from the pool.
Each pending response draining is subject to a cumulative 3-second static timeout. The timeout is refreshed after each successful read, acknowledging that progress is being made. If no data is read and the timeout is exceeded, the connection is closed.
This update introduces new CMAP events and logging messages (PendingResponseStarted, PendingResponseSucceeded, PendingResponseFailed) to improve observability of this path.
Please complete the following before merging:
clusters).