server: return draining progress if the server controller is draining #155063

iskettaneh · 2025-10-08T15:57:38Z

This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining. The CLI interprets this as if the draining is completed.

server: return draining progress if the server controller is draining

This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining field. The CLI interprets this as if the draining is completed.

Output example when running:

./cockroach node drain 3 --insecure --url {pgurl:3} --logtostderr=INFO

I251008 15:54:42.613200 15 2@rpc/peer.go:613  [rnode=?,raddr=10.142.1.228:26257,class=system,rpc] 1  connection is now healthy
node is draining... remaining: 3
I251008 15:54:42.622586 1 cli/rpc_node_shutdown.go:184  [-] 2  drain details: tenant servers: 3
node is draining... remaining: 3
I251008 15:54:42.824526 1 cli/rpc_node_shutdown.go:184  [-] 3  drain details: tenant servers: 3
node is draining... remaining: 1
I251008 15:54:43.026405 1 cli/rpc_node_shutdown.go:184  [-] 4  drain details: tenant servers: 1
node is draining... remaining: 1
I251008 15:54:43.228596 1 cli/rpc_node_shutdown.go:184  [-] 5  drain details: tenant servers: 1
node is draining... remaining: 243
I251008 15:54:44.580413 1 cli/rpc_node_shutdown.go:184  [-] 6  drain details: liveness record: 1, range lease iterations: 175, descriptor leases: 67
node is draining... remaining: 0 (complete)
drain ok

Release note (bug fix): fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.

Epic: None

blathers-crl · 2025-10-08T15:57:42Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-10-08T15:57:54Z

This change is

stevendanna

Overall this looks good to me. But, I think we really should try to write a tests for this one.

stevendanna

Thanks! Glad you tracked this down.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/server/drain.go line 418 at r1 (raw file):

	// If this is the first time we are draining the clients, we should delay
	// the

Sentence fragment here.

pkg/server/drain_test.go line 52 at r1 (raw file):

		require.NoError(tt, err)
		_, err = t.tc.ServerConn(0).Exec("ALTER TENANT hello START SERVICE SHARED")
		require.NoError(tt, err)

I wonder if we want to run a test connection to the secondary tenant to make sure it has started up before we call drain.

pkg/server/drain_test.go line 85 at r1 (raw file):

	}
	// Repeat drain commands until we verify that there are zero remaining leases
	// (i.e. complete). Also validate that the server did not sleep again.

I wonder if this test would have failed without the fix? It is looking at the drain response itself but not really checking that the node actually had its leases removed. I leave that to your own judgement.

iskettaneh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

pkg/server/drain.go line 418 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Sentence fragment here.

Done.

pkg/server/drain_test.go line 52 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I wonder if we want to run a test connection to the secondary tenant to make sure it has started up before we call drain.

Done.

pkg/server/drain_test.go line 85 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I wonder if this test would have failed without the fix? It is looking at the drain response itself but not really checking that the node actually had its leases removed. I leave that to your own judgement.

It actually fails on this check:

t.assertDraining(resp, true)

Because draining doesn't get set in the response, which is something the cli doesn't expect in the normal case.

iskettaneh

Thank you for reviewing this!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

iskettaneh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

pkg/server/drain_test.go line 113 at r2 (raw file):

	if secondaryTenants {
		// At this point we expect the secondary tenant to be disconnected.
		require.EqualError(t, tenantConn.Ping(), "driver: bad connection")

I am not sure why this failed in CI, I stressed it locally and it passed. Will investigate tomorrow

iskettaneh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

pkg/server/drain_test.go line 104 at r3 (raw file):

		}
		if resp.DrainRemainingIndicator > 0 {
			return errors.Newf("still %d remaining, desc: %s", resp.DrainRemainingIndicator,

Another issue I found in CI stressrace is that the test sometimes fails with:

          drain_test.go:98: condition failed to evaluate within 3m45s: from drain_test.go:104: still 2 remaining, desc: range lease iterations: 2

I wonder what is causing these 2 ranges to get stuck

iskettaneh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

pkg/server/drain_test.go line 104 at r3 (raw file):

Previously, iskettaneh wrote…

Another issue I found in CI stressrace is that the test sometimes fails with:
          drain_test.go:98: condition failed to evaluate within 3m45s: from drain_test.go:104: still 2 remaining, desc: range lease iterations: 2
I wonder what is causing these 2 ranges to get stuck

as a debugging measure, I added a few log lines and increased the timeout. The hope is to catch this in CI and understand the problem better

iskettaneh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

pkg/server/drain_test.go line 104 at r3 (raw file):

Previously, iskettaneh wrote…

as a debugging measure, I added a few log lines and increased the timeout. The hope is to catch this in CI and understand the problem better

I didn't find I clear issue. It just seemed that under stress race, the system is under too much load, and it takes a long time for the tenant servers to drain, and then draining the leases times out

rafiss · 2025-10-09T16:11:06Z

This seems like it would be a noticeable bug -- should this change get a release note? (Especially if we plan to backport to all releases.)

Also, I'm wondering about the impact of the bug. Does it affect UA clusters?

stevendanna · 2025-10-09T16:18:48Z

Does it affect UA clusters?

Yes, I believe it likely does.

stevendanna · 2025-10-09T16:19:10Z

This seems like it would be a noticeable bug -- should this change get a release note? (Especially if we plan to backport to all releases.)

Yes, we should definitely have a release note.

stevendanna

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)

pkg/server/drain_test.go line 85 at r1 (raw file):

Previously, iskettaneh wrote…

It actually fails on this check:

t.assertDraining(resp, true)

Because draining doesn't get set in the response, which is something the cli doesn't expect in the normal case.

Ah, I see. I think that is OK for this PR. I

I think my overall concern is that we didn't have a test that made sure the actual thing drain is trying to do (remove leases, gracefully end sql connections) actually happened.

-- commits line 6 at r5:
missing a word:

"without populating the IsDraining" -> "without populating the IsDraining field"

if you fancy:

"The CLI interprets this as if the draining is completed." -> "The CLI interprets this to mean draining is complete"

-- commits line 8 at r5:
Looks like something might have have happened during a rebase/squash as this is duplicated.

-- commits line 36 at r5:
We could spell out the consequence a bit more:

Fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.

pkg/server/drain_test.go line 47 at r5 (raw file):

func doTestDrain(tt *testing.T, secondaryTenants bool) {
	if secondaryTenants {
		// Draining the tenant server takes a long time under stress.

Given the unfortunate state of the server controller, I think this is probably reasonable for now. We could open an issue to track it for the db-server-team.

This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining field. The CLI interprets this as if the draining is completed. Output example when running: ``` ./cockroach node drain 3 --insecure --url {pgurl:3} --logtostderr=INFO I251008 15:54:42.613200 15 2@rpc/peer.go:613 [rnode=?,raddr=10.142.1.228:26257,class=system,rpc] 1 connection is now healthy node is draining... remaining: 3 I251008 15:54:42.622586 1 cli/rpc_node_shutdown.go:184 [-] 2 drain details: tenant servers: 3 node is draining... remaining: 3 I251008 15:54:42.824526 1 cli/rpc_node_shutdown.go:184 [-] 3 drain details: tenant servers: 3 node is draining... remaining: 1 I251008 15:54:43.026405 1 cli/rpc_node_shutdown.go:184 [-] 4 drain details: tenant servers: 1 node is draining... remaining: 1 I251008 15:54:43.228596 1 cli/rpc_node_shutdown.go:184 [-] 5 drain details: tenant servers: 1 node is draining... remaining: 243 I251008 15:54:44.580413 1 cli/rpc_node_shutdown.go:184 [-] 6 drain details: liveness record: 1, range lease iterations: 175, descriptor leases: 67 node is draining... remaining: 0 (complete) drain ok ``` Release note (bug fix): fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases. Epic: None

iskettaneh

Done

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

-- commits line 6 at r5:

Previously, stevendanna (Steven Danna) wrote…

missing a word:

"without populating the IsDraining" -> "without populating the IsDraining field"

if you fancy:

"The CLI interprets this as if the draining is completed." -> "The CLI interprets this to mean draining is complete"

Done.

-- commits line 8 at r5:

Previously, stevendanna (Steven Danna) wrote…

Looks like something might have have happened during a rebase/squash as this is duplicated.

Done.

-- commits line 36 at r5:

Previously, stevendanna (Steven Danna) wrote…

We could spell out the consequence a bit more:

Fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.

Thank you!

pkg/server/drain_test.go line 85 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Ah, I see. I think that is OK for this PR. I

I think my overall concern is that we didn't have a test that made sure the actual thing drain is trying to do (remove leases, gracefully end sql connections) actually happened.

I agree with you

pkg/server/drain_test.go line 47 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Given the unfortunate state of the server controller, I think this is probably reasonable for now. We could open an issue to track it for the db-server-team.

Done!
#155229

iskettaneh · 2025-10-10T12:18:34Z

TFTR!

bors r+

craig · 2025-10-10T13:18:06Z

Build succeeded:

blathers-crl · 2025-10-10T13:18:25Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-24.1-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-24.1 failed. See errors above.

💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-24.3-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-24.3 failed. See errors above.

💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-25.2-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.2 failed. See errors above.

💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-25.3-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.3 failed. See errors above.

💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-25.4-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.4 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

stevendanna added the backport-all Flags PRs that need to be backported to all supported release branches label Oct 8, 2025

iskettaneh requested a review from stevendanna October 8, 2025 17:08

stevendanna reviewed Oct 8, 2025

View reviewed changes

iskettaneh force-pushed the drain branch 2 times, most recently from 538281e to 1983965 Compare October 8, 2025 20:36

iskettaneh marked this pull request as ready for review October 8, 2025 20:40

iskettaneh requested review from a team as code owners October 8, 2025 20:40

stevendanna reviewed Oct 8, 2025

View reviewed changes

iskettaneh force-pushed the drain branch from 1983965 to b753b91 Compare October 8, 2025 23:07

iskettaneh commented Oct 8, 2025

View reviewed changes

iskettaneh requested a review from stevendanna October 8, 2025 23:08

iskettaneh commented Oct 8, 2025

View reviewed changes

iskettaneh force-pushed the drain branch from b753b91 to 173ee4b Compare October 9, 2025 01:12

iskettaneh commented Oct 9, 2025

View reviewed changes

iskettaneh force-pushed the drain branch from 173ee4b to d4e4682 Compare October 9, 2025 02:04

iskettaneh commented Oct 9, 2025

View reviewed changes

iskettaneh force-pushed the drain branch 3 times, most recently from b65a126 to 6dadc3d Compare October 9, 2025 13:25

iskettaneh commented Oct 9, 2025

View reviewed changes

iskettaneh force-pushed the drain branch from 6dadc3d to ef7cea7 Compare October 9, 2025 17:52

stevendanna approved these changes Oct 10, 2025

View reviewed changes

iskettaneh mentioned this pull request Oct 10, 2025

servercrl: draining tenant servers under stress race takes a very long time #155229

Open

iskettaneh force-pushed the drain branch from ef7cea7 to 4a3685d Compare October 10, 2025 12:14

iskettaneh force-pushed the drain branch from 4a3685d to 46f5bbe Compare October 10, 2025 12:16

iskettaneh commented Oct 10, 2025

View reviewed changes

craig bot merged commit d1ebc69 into cockroachdb:master Oct 10, 2025
24 of 25 checks passed

celeste-cockroachdb bot added the target-release-26.1.0 label Oct 10, 2025

blathers-crl bot added the backport-failed label Oct 10, 2025

server: return draining progress if the server controller is draining #155063

server: return draining progress if the server controller is draining #155063

Uh oh!

Conversation

iskettaneh commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Oct 8, 2025

Uh oh!

cockroach-teamcity commented Oct 8, 2025

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

rafiss commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevendanna commented Oct 9, 2025

Uh oh!

stevendanna commented Oct 9, 2025

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

iskettaneh commented Oct 10, 2025

Uh oh!

craig bot commented Oct 10, 2025

Uh oh!

Uh oh!

blathers-crl bot commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iskettaneh commented Oct 8, 2025 •

edited

Loading

rafiss commented Oct 9, 2025 •

edited

Loading