fix(pool): pool performance #3565

ndyakov · 2025-10-25T09:48:35Z

Working on #3559 I saw a degradation in the pool performance. In this PR I am trying to reduce the degradation a bit:

faster pool semaphores that uses atomics for counting (counting semaphore)
lockless hook manager - using an atomic pointer
predefined slices of states to reduce allocations in hot paths
inlining of state manager call to improve the performance in hot paths

Metric	BEFORE	AFTER	Improvement
Pool Get/Put Speed	403-931 ns/op	205-305 ns/op	47-67% faster
Allocations	48 B/op, 4 allocs/op	32 B/op, 2 allocs/op	33% less memory, 50% fewer allocs
State Machine	26.36 ns/op	15.32 ns/op	42% faster

…redefined state slices - Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths - Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.) - Add Clone() method to PoolHookManager for atomic updates - Update AddPoolHook/RemovePoolHook to use copy-on-write pattern - Update all hookManager access points to use atomic Load() Performance improvements: - Eliminates RWMutex contention in Get/Put/Remove hot paths - Reduces allocations by reusing predefined state slices - Lock-free reads allow better CPU cache utilization

The state machine was calling notifyWaiters() on EVERY Get/Put operation, which acquired a mutex even when no waiters were present (the common case). Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex. This eliminates mutex contention in the hot path (Get/Put operations). Implementation: - Added atomic.Int32 waiterCount field to ConnStateMachine - Increment when adding waiter, decrement when removing - Check waiterCount atomically before acquiring mutex in notifyWaiters() Performance impact: - Before: mutex lock/unlock on every Get/Put (even with no waiters) - After: lock-free atomic check, only acquire mutex if waiters exist - Expected improvement: ~30-50% for Get/Put operations

…ot path The pool was creating new slice literals on EVERY Get/Put operation: - popIdle(): []ConnState{StateCreated, StateIdle} - putConn(): []ConnState{StateInUse} - CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse} - MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated} These allocations were happening millions of times per second in the hot path. Fix: Use predefined global slices defined in conn_state.go: - validFromInUse - validFromCreatedOrIdle - validFromCreatedInUseOrIdle Performance impact: - Before: 4 slice allocations per Get/Put cycle - After: 0 allocations (use predefined slices) - Expected improvement: ~30-40% reduction in allocations and GC pressure

Further optimize the hot path by: 1. Remove redundant GetState() call in the loop 2. Only check waiterCount after successful CAS (not before loop) 3. Inline the waiterCount check to avoid notifyWaiters() call overhead This reduces atomic operations from 4-5 per Get/Put to 2-3: - Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check - After: CAS + waiterCount.Load() (only if CAS succeeds) Performance impact: - Eliminates 1-2 atomic operations per Get/Put - Expected improvement: ~10-15% for Get/Put operations

Introduced TryTransitionFast() for the hot path (Get/Put operations): - Single CAS operation (same as master's atomic bool) - No waiter notification overhead - No loop through valid states - No error allocation Hot path flow: 1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE 2. putConn(): Try IN_USE → IDLE (fast) This matches master's performance while preserving state machine for: - Background operations (handoff/reauth use UNUSABLE state) - State validation (TryTransition still available) - Waiter notification (AwaitAndTransition for blocking) Performance comparison per Get/Put cycle: - Master: 2 atomic CAS operations - State machine (before): 5 atomic operations (2.5x slower) - State machine (after): 2 atomic CAS operations (same as master!) Expected improvement: Restore to baseline ~11,373 ops/sec

Copilot

Pull Request Overview

This PR aims to improve pool performance by replacing synchronization primitives with more efficient atomic-based implementations and optimizing hot path operations through inlining and allocation reduction.

Key Changes:

Introduced FastSemaphore using atomic operations instead of channel-based semaphores for connection limiting
Replaced mutex-protected hook manager with lock-free atomic pointer operations
Added predefined state slices and inline methods to reduce allocations in critical paths

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/semaphore.go	New `FastSemaphore` implementation using atomics for high-performance connection limiting
internal/pool/pool.go	Replaced channel-based queue with `FastSemaphore` and converted hook manager to atomic pointer
internal/pool/hooks_test.go	Updated tests to use atomic loads for hook manager access
internal/pool/hooks.go	Added `Clone()` method to support lock-free hook manager updates
internal/pool/export_test.go	Updated test helper to use semaphore instead of channel queue
internal/pool/conn_state.go	Added predefined state slices, optimized waiter tracking with atomic counter, and fast transition method
internal/pool/conn.go	Added inline `TryAcquire()` and `Release()` methods with direct state machine access for hot paths
internal/auth/streaming/pool_hook.go	Migrated from channel-based semaphore to `FastSemaphore` for re-auth worker limiting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/semaphore.go

internal/pool/pool.go

internal/pool/conn_state.go

internal/pool/conn.go

Co-authored-by: Copilot <[email protected]>

* perf(pool): replace hookManager RWMutex with atomic.Pointer and add predefined state slices - Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths - Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.) - Add Clone() method to PoolHookManager for atomic updates - Update AddPoolHook/RemovePoolHook to use copy-on-write pattern - Update all hookManager access points to use atomic Load() Performance improvements: - Eliminates RWMutex contention in Get/Put/Remove hot paths - Reduces allocations by reusing predefined state slices - Lock-free reads allow better CPU cache utilization * perf(pool): eliminate mutex overhead in state machine hot path The state machine was calling notifyWaiters() on EVERY Get/Put operation, which acquired a mutex even when no waiters were present (the common case). Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex. This eliminates mutex contention in the hot path (Get/Put operations). Implementation: - Added atomic.Int32 waiterCount field to ConnStateMachine - Increment when adding waiter, decrement when removing - Check waiterCount atomically before acquiring mutex in notifyWaiters() Performance impact: - Before: mutex lock/unlock on every Get/Put (even with no waiters) - After: lock-free atomic check, only acquire mutex if waiters exist - Expected improvement: ~30-50% for Get/Put operations * perf(pool): use predefined state slices to eliminate allocations in hot path The pool was creating new slice literals on EVERY Get/Put operation: - popIdle(): []ConnState{StateCreated, StateIdle} - putConn(): []ConnState{StateInUse} - CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse} - MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated} These allocations were happening millions of times per second in the hot path. Fix: Use predefined global slices defined in conn_state.go: - validFromInUse - validFromCreatedOrIdle - validFromCreatedInUseOrIdle Performance impact: - Before: 4 slice allocations per Get/Put cycle - After: 0 allocations (use predefined slices) - Expected improvement: ~30-40% reduction in allocations and GC pressure * perf(pool): optimize TryTransition to reduce atomic operations Further optimize the hot path by: 1. Remove redundant GetState() call in the loop 2. Only check waiterCount after successful CAS (not before loop) 3. Inline the waiterCount check to avoid notifyWaiters() call overhead This reduces atomic operations from 4-5 per Get/Put to 2-3: - Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check - After: CAS + waiterCount.Load() (only if CAS succeeds) Performance impact: - Eliminates 1-2 atomic operations per Get/Put - Expected improvement: ~10-15% for Get/Put operations * perf(pool): add fast path for Get/Put to match master performance Introduced TryTransitionFast() for the hot path (Get/Put operations): - Single CAS operation (same as master's atomic bool) - No waiter notification overhead - No loop through valid states - No error allocation Hot path flow: 1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE 2. putConn(): Try IN_USE → IDLE (fast) This matches master's performance while preserving state machine for: - Background operations (handoff/reauth use UNUSABLE state) - State validation (TryTransition still available) - Waiter notification (AwaitAndTransition for blocking) Performance comparison per Get/Put cycle: - Master: 2 atomic CAS operations - State machine (before): 5 atomic operations (2.5x slower) - State machine (after): 2 atomic CAS operations (same as master!) Expected improvement: Restore to baseline ~11,373 ops/sec * combine cas * fix linter * try faster approach * fast semaphore * better inlining for hot path * fix linter issues * use new semaphore in auth as well * linter should be happy now * add comments * Update internal/pool/conn_state.go Co-authored-by: Copilot <[email protected]> * address comment * slight reordering * try to cache time if for non-critical calculation * fix wrong benchmark * add concurrent test * fix benchmark report * add additional expect to check output * comment and variable rename --------- Co-authored-by: Copilot <[email protected]>

* wip * wip, used and unusable states * polish state machine * correct handling OnPut * better errors for tests, hook should work now * fix linter * improve reauth state management. fix tests * Update internal/pool/conn.go Co-authored-by: Copilot <[email protected]> * Update internal/pool/conn.go Co-authored-by: Copilot <[email protected]> * better timeouts * empty endpoint handoff case * fix handoff state when queued for handoff * try to detect the deadlock * try to detect the deadlock x2 * delete should be called * improve tests * fix mark on uninitialized connection * Update internal/pool/conn_state_test.go Co-authored-by: Copilot <[email protected]> * Update internal/pool/conn_state_test.go Co-authored-by: Copilot <[email protected]> * Update internal/pool/pool.go Co-authored-by: Copilot <[email protected]> * Update internal/pool/conn_state.go Co-authored-by: Copilot <[email protected]> * Update internal/pool/conn.go Co-authored-by: Copilot <[email protected]> * fix error from copilot * address copilot comment * fix(pool): pool performance (#3565) * perf(pool): replace hookManager RWMutex with atomic.Pointer and add predefined state slices - Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths - Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.) - Add Clone() method to PoolHookManager for atomic updates - Update AddPoolHook/RemovePoolHook to use copy-on-write pattern - Update all hookManager access points to use atomic Load() Performance improvements: - Eliminates RWMutex contention in Get/Put/Remove hot paths - Reduces allocations by reusing predefined state slices - Lock-free reads allow better CPU cache utilization * perf(pool): eliminate mutex overhead in state machine hot path The state machine was calling notifyWaiters() on EVERY Get/Put operation, which acquired a mutex even when no waiters were present (the common case). Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex. This eliminates mutex contention in the hot path (Get/Put operations). Implementation: - Added atomic.Int32 waiterCount field to ConnStateMachine - Increment when adding waiter, decrement when removing - Check waiterCount atomically before acquiring mutex in notifyWaiters() Performance impact: - Before: mutex lock/unlock on every Get/Put (even with no waiters) - After: lock-free atomic check, only acquire mutex if waiters exist - Expected improvement: ~30-50% for Get/Put operations * perf(pool): use predefined state slices to eliminate allocations in hot path The pool was creating new slice literals on EVERY Get/Put operation: - popIdle(): []ConnState{StateCreated, StateIdle} - putConn(): []ConnState{StateInUse} - CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse} - MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated} These allocations were happening millions of times per second in the hot path. Fix: Use predefined global slices defined in conn_state.go: - validFromInUse - validFromCreatedOrIdle - validFromCreatedInUseOrIdle Performance impact: - Before: 4 slice allocations per Get/Put cycle - After: 0 allocations (use predefined slices) - Expected improvement: ~30-40% reduction in allocations and GC pressure * perf(pool): optimize TryTransition to reduce atomic operations Further optimize the hot path by: 1. Remove redundant GetState() call in the loop 2. Only check waiterCount after successful CAS (not before loop) 3. Inline the waiterCount check to avoid notifyWaiters() call overhead This reduces atomic operations from 4-5 per Get/Put to 2-3: - Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check - After: CAS + waiterCount.Load() (only if CAS succeeds) Performance impact: - Eliminates 1-2 atomic operations per Get/Put - Expected improvement: ~10-15% for Get/Put operations * perf(pool): add fast path for Get/Put to match master performance Introduced TryTransitionFast() for the hot path (Get/Put operations): - Single CAS operation (same as master's atomic bool) - No waiter notification overhead - No loop through valid states - No error allocation Hot path flow: 1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE 2. putConn(): Try IN_USE → IDLE (fast) This matches master's performance while preserving state machine for: - Background operations (handoff/reauth use UNUSABLE state) - State validation (TryTransition still available) - Waiter notification (AwaitAndTransition for blocking) Performance comparison per Get/Put cycle: - Master: 2 atomic CAS operations - State machine (before): 5 atomic operations (2.5x slower) - State machine (after): 2 atomic CAS operations (same as master!) Expected improvement: Restore to baseline ~11,373 ops/sec * combine cas * fix linter * try faster approach * fast semaphore * better inlining for hot path * fix linter issues * use new semaphore in auth as well * linter should be happy now * add comments * Update internal/pool/conn_state.go Co-authored-by: Copilot <[email protected]> * address comment * slight reordering * try to cache time if for non-critical calculation * fix wrong benchmark * add concurrent test * fix benchmark report * add additional expect to check output * comment and variable rename --------- Co-authored-by: Copilot <[email protected]> * initConn sets IDLE state - Handle unexpected conn state changes * fix precision of time cache and usedAt * allow e2e tests to run longer * Fix broken initialization of idle connections * optimize push notif * 100ms -> 50ms * use correct timer for last health check * verify pass auth on conn creation * fix assertion * fix unsafe test * fix benchmark test * improve remove conn * re doesn't support requirepass * wait more in e2e test * flaky test * add missed method in interface * fix test assertions * silence logs and faster hooks manager * address linter comment * fix flaky test * use read instad of control * use pool size for semsize * CAS instead of reading the state * preallocate errors and states * preallocate state slices * fix flaky test * fix fast semaphore that could have been starved * try to fix the semaphore * should properly notify the waiters - this way a waiter that timesout at the same time a releaser is releasing, won't throw token. the releaser will fail to notify and will pick another waiter. this hybrid approach should be faster than channels and maintains FIFO * waiter may double-release (if closed/times out) * priority of operations * use simple approach of fifo waiters * use simple channel based semaphores * address linter and tests * remove unused benchs * change log message * address pr comments * address pr comments * fix data race --------- Co-authored-by: Copilot <[email protected]>

ndyakov self-assigned this Oct 25, 2025

ndyakov force-pushed the ndyakov/pool-performance branch 3 times, most recently from fce8679 to 3478f7a Compare October 25, 2025 11:37

ndyakov force-pushed the ndyakov/pool-performance branch from c914088 to 1b0168d Compare October 25, 2025 12:01

ndyakov force-pushed the ndyakov/pool-performance branch from 079c217 to 78bcfbb Compare October 25, 2025 16:29

ndyakov added 4 commits October 25, 2025 19:37

combine cas

9481c4d

ndyakov changed the base branch from master to ndyakov/state-machine-conn October 25, 2025 18:21

ndyakov and others added 8 commits October 25, 2025 21:23

fix linter

d43b973

try faster approach

d34f1e0

fast semaphore

c02fe3e

better inlining for hot path

9ec5dae

fix linter issues

f8feb0e

use new semaphore in auth as well

5ff5463

linter should be happy now

0878735

Merge branch 'ndyakov/state-machine-conn' into ndyakov/pool-performance

b2225f1

ndyakov requested review from Copilot and htemelski-redis October 25, 2025 21:39

ndyakov marked this pull request as ready for review October 25, 2025 21:39

Copilot AI reviewed Oct 25, 2025

View reviewed changes

internal/semaphore.go Outdated Show resolved Hide resolved

internal/pool/pool.go Show resolved Hide resolved

internal/pool/conn_state.go Outdated Show resolved Hide resolved

internal/pool/conn.go Show resolved Hide resolved

internal/pool/conn.go Show resolved Hide resolved

ndyakov added the maintenance label Oct 25, 2025

ndyakov and others added 4 commits October 26, 2025 00:50

add comments

f52ab34

Update internal/pool/conn_state.go

d64c4eb

Co-authored-by: Copilot <[email protected]>

address comment

8654d1a

slight reordering

bec09a5

ndyakov added 4 commits October 27, 2025 07:30

try to cache time if for non-critical calculation

55c502d

fix wrong benchmark

4a30663

add concurrent test

316aeb7

fix benchmark report

da5fe33

ndyakov force-pushed the ndyakov/pool-performance branch from e825db8 to da5fe33 Compare October 27, 2025 06:05

add additional expect to check output

471a828

ndyakov force-pushed the ndyakov/pool-performance branch from 6928e52 to 471a828 Compare October 27, 2025 06:21

ndyakov changed the title ~~(wip) pool performance~~ fix(pool): pool performance Oct 27, 2025

comment and variable rename

9f3f8b7

htemelski-redis approved these changes Oct 27, 2025

View reviewed changes

ndyakov merged commit 080a33c into ndyakov/state-machine-conn Oct 27, 2025
34 checks passed

ndyakov deleted the ndyakov/pool-performance branch October 27, 2025 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pool): pool performance #3565

fix(pool): pool performance #3565

ndyakov commented Oct 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(pool): pool performance #3565

fix(pool): pool performance #3565

Conversation

ndyakov commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ndyakov commented Oct 25, 2025 •

edited

Loading