-
Notifications
You must be signed in to change notification settings - Fork 4k
server: return draining progress if the server controller is draining #155063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
stevendanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me. But, I think we really should try to write a tests for this one.
538281e to
1983965
Compare
stevendanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Glad you tracked this down.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/server/drain.go line 418 at r1 (raw file):
// If this is the first time we are draining the clients, we should delay // the
Sentence fragment here.
pkg/server/drain_test.go line 52 at r1 (raw file):
require.NoError(tt, err) _, err = t.tc.ServerConn(0).Exec("ALTER TENANT hello START SERVICE SHARED") require.NoError(tt, err)
I wonder if we want to run a test connection to the secondary tenant to make sure it has started up before we call drain.
pkg/server/drain_test.go line 85 at r1 (raw file):
} // Repeat drain commands until we verify that there are zero remaining leases // (i.e. complete). Also validate that the server did not sleep again.
I wonder if this test would have failed without the fix? It is looking at the drain response itself but not really checking that the node actually had its leases removed. I leave that to your own judgement.
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
pkg/server/drain.go line 418 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Sentence fragment here.
Done.
pkg/server/drain_test.go line 52 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I wonder if we want to run a test connection to the secondary tenant to make sure it has started up before we call drain.
Done.
pkg/server/drain_test.go line 85 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I wonder if this test would have failed without the fix? It is looking at the drain response itself but not really checking that the node actually had its leases removed. I leave that to your own judgement.
It actually fails on this check:
t.assertDraining(resp, true)
Because draining doesn't get set in the response, which is something the cli doesn't expect in the normal case.
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing this!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
pkg/server/drain_test.go line 113 at r2 (raw file):
if secondaryTenants { // At this point we expect the secondary tenant to be disconnected. require.EqualError(t, tenantConn.Ping(), "driver: bad connection")
I am not sure why this failed in CI, I stressed it locally and it passed. Will investigate tomorrow
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
pkg/server/drain_test.go line 104 at r3 (raw file):
} if resp.DrainRemainingIndicator > 0 { return errors.Newf("still %d remaining, desc: %s", resp.DrainRemainingIndicator,
Another issue I found in CI stressrace is that the test sometimes fails with:
drain_test.go:98: condition failed to evaluate within 3m45s: from drain_test.go:104: still 2 remaining, desc: range lease iterations: 2
I wonder what is causing these 2 ranges to get stuck
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
pkg/server/drain_test.go line 104 at r3 (raw file):
Previously, iskettaneh wrote…
Another issue I found in CI stressrace is that the test sometimes fails with:
drain_test.go:98: condition failed to evaluate within 3m45s: from drain_test.go:104: still 2 remaining, desc: range lease iterations: 2I wonder what is causing these 2 ranges to get stuck
as a debugging measure, I added a few log lines and increased the timeout. The hope is to catch this in CI and understand the problem better
b65a126 to
6dadc3d
Compare
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
pkg/server/drain_test.go line 104 at r3 (raw file):
Previously, iskettaneh wrote…
as a debugging measure, I added a few log lines and increased the timeout. The hope is to catch this in CI and understand the problem better
I didn't find I clear issue. It just seemed that under stress race, the system is under too much load, and it takes a long time for the tenant servers to drain, and then draining the leases times out
|
This seems like it would be a noticeable bug -- should this change get a release note? (Especially if we plan to backport to all releases.) Also, I'm wondering about the impact of the bug. Does it affect UA clusters? |
Yes, I believe it likely does. |
Yes, we should definitely have a release note. |
stevendanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/server/drain_test.go line 85 at r1 (raw file):
Previously, iskettaneh wrote…
It actually fails on this check:
t.assertDraining(resp, true)
Because draining doesn't get set in the response, which is something the cli doesn't expect in the normal case.
Ah, I see. I think that is OK for this PR. I
I think my overall concern is that we didn't have a test that made sure the actual thing drain is trying to do (remove leases, gracefully end sql connections) actually happened.
-- commits line 6 at r5:
missing a word:
"without populating the IsDraining" -> "without populating the IsDraining field"
if you fancy:
"The CLI interprets this as if the draining is completed." -> "The CLI interprets this to mean draining is complete"
-- commits line 8 at r5:
Looks like something might have have happened during a rebase/squash as this is duplicated.
-- commits line 36 at r5:
We could spell out the consequence a bit more:
Fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.
pkg/server/drain_test.go line 47 at r5 (raw file):
func doTestDrain(tt *testing.T, secondaryTenants bool) { if secondaryTenants { // Draining the tenant server takes a long time under stress.
Given the unfortunate state of the server controller, I think this is probably reasonable for now. We could open an issue to track it for the db-server-team.
This commit fixes a bug where if the server controller is being drained,
the drain command would return a response without populating the
IsDraining field. The CLI interprets this as if the draining is
completed.
Output example when running:
```
./cockroach node drain 3 --insecure --url {pgurl:3} --logtostderr=INFO
I251008 15:54:42.613200 15 2@rpc/peer.go:613 [rnode=?,raddr=10.142.1.228:26257,class=system,rpc] 1 connection is now healthy
node is draining... remaining: 3
I251008 15:54:42.622586 1 cli/rpc_node_shutdown.go:184 [-] 2 drain details: tenant servers: 3
node is draining... remaining: 3
I251008 15:54:42.824526 1 cli/rpc_node_shutdown.go:184 [-] 3 drain details: tenant servers: 3
node is draining... remaining: 1
I251008 15:54:43.026405 1 cli/rpc_node_shutdown.go:184 [-] 4 drain details: tenant servers: 1
node is draining... remaining: 1
I251008 15:54:43.228596 1 cli/rpc_node_shutdown.go:184 [-] 5 drain details: tenant servers: 1
node is draining... remaining: 243
I251008 15:54:44.580413 1 cli/rpc_node_shutdown.go:184 [-] 6 drain details: liveness record: 1, range lease iterations: 175, descriptor leases: 67
node is draining... remaining: 0 (complete)
drain ok
```
Release note (bug fix): fixed a bug in the drain command where draining
a node using virtual clusters (such as clusters running Physical
Cluster Replication) could return before the drain was complete,
possibly resulting in shutting down the node while it still had active
SQL clients and ranges leases.
Epic: None
iskettaneh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
Previously, stevendanna (Steven Danna) wrote…
missing a word:
"without populating the IsDraining" -> "without populating the IsDraining field"
if you fancy:
"The CLI interprets this as if the draining is completed." -> "The CLI interprets this to mean draining is complete"
Done.
Previously, stevendanna (Steven Danna) wrote…
Looks like something might have have happened during a rebase/squash as this is duplicated.
Done.
Previously, stevendanna (Steven Danna) wrote…
We could spell out the consequence a bit more:
Fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.
Thank you!
pkg/server/drain_test.go line 85 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Ah, I see. I think that is OK for this PR. I
I think my overall concern is that we didn't have a test that made sure the actual thing drain is trying to do (remove leases, gracefully end sql connections) actually happened.
I agree with you
pkg/server/drain_test.go line 47 at r5 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Given the unfortunate state of the server controller, I think this is probably reasonable for now. We could open an issue to track it for the db-server-team.
Done!
#155229
|
TFTR! bors r+ |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from 46f5bbe to blathers/backport-release-24.1-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-24.1 failed. See errors above. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from 46f5bbe to blathers/backport-release-24.3-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-24.3 failed. See errors above. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from 46f5bbe to blathers/backport-release-25.2-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-25.2 failed. See errors above. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from 46f5bbe to blathers/backport-release-25.3-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-25.3 failed. See errors above. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from 46f5bbe to blathers/backport-release-25.4-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-25.4 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining. The CLI interprets this as if the draining is completed.
server: return draining progress if the server controller is draining
This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining field. The CLI interprets this as if the draining is completed.
Output example when running:
Release note (bug fix): fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.
Epic: None