Skip to content

Conversation

@pranshu-raj-211
Copy link
Contributor

Refer to PR #101 (got closed while renaming branch)

What this PR does

Adds exponential backoff and jitter to replace fixed delays in system_database.go

How it is implemented

Replaced delay interval constant by constants

    _DB_CONNECTION_RETRY_BASE_DELAY = 500 * time.Millisecond
    _DB_CONNECTION_RETRY_FACTOR = 2
    _DB_CONNECTION_RETRY_MAX_RETRIES = 3
    _DB_CONNECTION_MAX_DELAY = 10 * time.Second

Func backoffWithJitter takes an argument attempt - attempts of backoff delay done yet.
Calculates the delay as

exp = baseDelay * math.Pow(retryFactor, attempts)
jitter = 0.75 + rand.Float64()*0.5  // +-25% of exp to be jitter
delay = exp * jitter

Capping provided against delay intervals getting too large.

If number of retries exceeds a certain limit - fallback to fixed base delay (500ms)

Jitter added separately even for fixed intervals.

Constants may need to be changed.

Tests

I do have a basic table driven test for this, unsure of which file to add it in.

not returning errors anymore, if retries exceed max retries then return base fixed delay

Signed-off-by: pranshu-raj-211 <[email protected]>
use percentage of exp rather than small delay

Signed-off-by: pranshu-raj-211 <[email protected]>
Comment on lines 139 to 143
_DB_CONNECTION_RETRY_BASE_DELAY = 500 * time.Millisecond
_DB_CONNECTION_RETRY_FACTOR = 2
_DB_CONNECTION_RETRY_MAX_RETRIES = 3
_DB_CONNECTION_MAX_DELAY = 10000 * time.Millisecond
_DB_RETRY_INTERVAL = 1 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be a bit more lenient on these:
Max delay: 2 minutes
Max retries: 10
Also the base retry delay should be = to the base interval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the base interval do you mean the _DB_RETRY_INTERVAL?
The issue #95 showed the variable that was being used for this, which in the code was _DB_CONNECTION_RETRY_DELAY = 500 * time.Millisecond, which I took as default.

Changed the other values, updating tests to match the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make _DB_CONNECTION_RETRY_BASE_DELAY equal 1 second. Thanks!

pranshu-raj-211 and others added 2 commits September 4, 2025 15:54
make base delay constant 1 second,
decrement retryAttempt in notificationListenerLoop,
hardcode sleep duration in shutdown,
slight change in logic - cap delays to max of 120 seconds, but allow for jitter in that too,
add tests according to calculations

Signed-off-by: pranshu-raj-211 <[email protected]>
@pranshu-raj-211
Copy link
Contributor Author

Changes done:

  1. Added tests
  2. Make _DB_CONNECTION_RETRY_BASE_DELAY 1 second
  3. Change _DB_CONNECTION_RETRY_MAX_RETRIES = 10, _DB_CONNECTION_MAX_DELAY = 120 * time.Second
  4. Slight change to logic - instead of directly returning the max delay (120 seconds), add jitter as normal - which can take maximum to 150 seconds.
  5. If max attempts have been done, earlier it used to return the base delay (1 second), changed to the max delay instead.

Changes that can be done:

  1. Remove the _DB_CONNECTION_RETRY_BASE_DELAY constant, replace with existing _DB_RETRY_INTERVAL.
  2. Revert logic change if that behavior is not intended.

@pranshu-raj-211
Copy link
Contributor Author

Also, curious about the security action failing, what does it do, gosec job failed at the jitter calculation in my code, just wanted to know.

@pranshu-raj-211
Copy link
Contributor Author

Ok, the security run fails as gosec wants the use of crypto/rand to replace math/rand. Should I make the changes or is the current version fine(small change)?

@maxdml
Copy link
Collaborator

maxdml commented Sep 4, 2025

Ok, the security run fails as gosec wants the use of crypto/rand to replace math/rand. Should I make the changes or is the current version fine(small change)?

You can disable the warning -- we're not using randomness in a way that needs security here. It can be done with a comment on the line of the warning. See how we do it elsewhere.

@pranshu-raj-211
Copy link
Contributor Author

Added the comment to ignore G404 - rand.
Also fixed issue causing test failure - used time.Second instead of time.Millisecond in the shutdown function, which caused a timeout.

@pranshu-raj-211
Copy link
Contributor Author

Resolved merge conflicts (resulting from #106 push to main).

Tested security and test actions locally (act, using medium size image config).

@pranshu-raj-211
Copy link
Contributor Author

Also, I was curious, how does this relate to #61

@maxdml
Copy link
Collaborator

maxdml commented Sep 5, 2025

Also, I was curious, how does this relate to #61

its a duplicate.

Thanks for your contribution!

@maxdml maxdml merged commit 165b569 into dbos-inc:main Sep 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants