-
Notifications
You must be signed in to change notification settings - Fork 140
CBG-4926 deflake topologytests #7824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- do not use testing.Context() for topologytests, since it gets cancelled before t.Cleanup() runs - switch BlipTesterCollectionClient._seqCond to be channels to avoid a lock and a potential hang in between if ctx.Err() != nil and _seqCond.Wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses test flakiness in topology tests by fixing context cancellation issues and replacing sync.Cond with a channel-based notification system to prevent race conditions in push replication. The primary issue was that testing.Context()
was being cancelled before t.Cleanup()
runs, causing test failures, and a potential hang between context cancellation checks and condition variable waits.
Key changes:
- Replaced
testing.Context()
withcontext.WithoutCancel()
to prevent premature cancellation - Refactored
BlipTesterCollectionClient
to use channels instead ofsync.Cond
for sequence notifications - Added cancel causes for better debugging of context cancellations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
topologytest/couchbase_lite_mock_peer_test.go | Uses context.WithoutCancel() to prevent test context cancellation issues |
rest/utilities_testing_blip_client.go | Major refactor replacing sync.Cond with channel-based notifications and adding cancel causes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
// reawaken any waiting push loops to check for cancellation, | ||
// This is a workaround for a race between ctx.Err() != nil check and btcc._seqCond.Wait() | ||
btcc._seqCond.Broadcast() | ||
}, 20*time.Second, 10*time.Millisecond) | ||
|
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Broadcast()
call is executed on every polling iteration (every 10ms for up to 20 seconds). This could lead to unnecessary overhead. Consider moving the Broadcast()
call outside the polling loop or adding a mechanism to only call it when needed.
// reawaken any waiting push loops to check for cancellation, | |
// This is a workaround for a race between ctx.Err() != nil check and btcc._seqCond.Wait() | |
btcc._seqCond.Broadcast() | |
}, 20*time.Second, 10*time.Millisecond) | |
// No need to broadcast on every iteration; a single broadcast before and after the loop suffices. | |
}, 20*time.Second, 10*time.Millisecond) | |
// Final broadcast to ensure all waiters are woken up after the loop | |
btcc._seqCond.Broadcast() |
Copilot uses AI. Check for mistakes.
98% of the test failures were from
testing.Context()
in couchbase_lite_mock_peer_test.go but running the tests a large number of times showed that there was a case where there was a hang in this (old) code:Pre-review checklist
fmt.Print
,log.Print
, ...)base.UD(docID)
,base.MD(dbName)
)docs/api
Integration Tests
GSI=true,xattrs=true
https://jenkins.sgwdev.com/job/SyncGatewayIntegration/139/ (known failure)