Skip to content

Conversation

@iskettaneh
Copy link
Contributor

@iskettaneh iskettaneh commented Oct 8, 2025

This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining. The CLI interprets this as if the draining is completed.

server: return draining progress if the server controller is draining

This commit fixes a bug where if the server controller is being drained, the drain command would return a response without populating the IsDraining field. The CLI interprets this as if the draining is completed.

Output example when running:

./cockroach node drain 3 --insecure --url {pgurl:3} --logtostderr=INFO

I251008 15:54:42.613200 15 2@rpc/peer.go:613  [rnode=?,raddr=10.142.1.228:26257,class=system,rpc] 1  connection is now healthy
node is draining... remaining: 3
I251008 15:54:42.622586 1 cli/rpc_node_shutdown.go:184  [-] 2  drain details: tenant servers: 3
node is draining... remaining: 3
I251008 15:54:42.824526 1 cli/rpc_node_shutdown.go:184  [-] 3  drain details: tenant servers: 3
node is draining... remaining: 1
I251008 15:54:43.026405 1 cli/rpc_node_shutdown.go:184  [-] 4  drain details: tenant servers: 1
node is draining... remaining: 1
I251008 15:54:43.228596 1 cli/rpc_node_shutdown.go:184  [-] 5  drain details: tenant servers: 1
node is draining... remaining: 243
I251008 15:54:44.580413 1 cli/rpc_node_shutdown.go:184  [-] 6  drain details: liveness record: 1, range lease iterations: 175, descriptor leases: 67
node is draining... remaining: 0 (complete)
drain ok

Release note (bug fix): fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.

Epic: None

@blathers-crl
Copy link

blathers-crl bot commented Oct 8, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna added the backport-all Flags PRs that need to be backported to all supported release branches label Oct 8, 2025
@iskettaneh iskettaneh requested a review from stevendanna October 8, 2025 17:08
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. But, I think we really should try to write a tests for this one.

@iskettaneh iskettaneh force-pushed the drain branch 2 times, most recently from 538281e to 1983965 Compare October 8, 2025 20:36
@iskettaneh iskettaneh marked this pull request as ready for review October 8, 2025 20:40
@iskettaneh iskettaneh requested review from a team as code owners October 8, 2025 20:40
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Glad you tracked this down.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/drain.go line 418 at r1 (raw file):

	// If this is the first time we are draining the clients, we should delay
	// the

Sentence fragment here.


pkg/server/drain_test.go line 52 at r1 (raw file):

		require.NoError(tt, err)
		_, err = t.tc.ServerConn(0).Exec("ALTER TENANT hello START SERVICE SHARED")
		require.NoError(tt, err)

I wonder if we want to run a test connection to the secondary tenant to make sure it has started up before we call drain.


pkg/server/drain_test.go line 85 at r1 (raw file):

	}
	// Repeat drain commands until we verify that there are zero remaining leases
	// (i.e. complete). Also validate that the server did not sleep again.

I wonder if this test would have failed without the fix? It is looking at the drain response itself but not really checking that the node actually had its leases removed. I leave that to your own judgement.

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/server/drain.go line 418 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Sentence fragment here.

Done.


pkg/server/drain_test.go line 52 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I wonder if we want to run a test connection to the secondary tenant to make sure it has started up before we call drain.

Done.


pkg/server/drain_test.go line 85 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I wonder if this test would have failed without the fix? It is looking at the drain response itself but not really checking that the node actually had its leases removed. I leave that to your own judgement.

It actually fails on this check:

t.assertDraining(resp, true)

Because draining doesn't get set in the response, which is something the cli doesn't expect in the normal case.

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing this!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

@iskettaneh iskettaneh requested a review from stevendanna October 8, 2025 23:08
Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/server/drain_test.go line 113 at r2 (raw file):

	if secondaryTenants {
		// At this point we expect the secondary tenant to be disconnected.
		require.EqualError(t, tenantConn.Ping(), "driver: bad connection")

I am not sure why this failed in CI, I stressed it locally and it passed. Will investigate tomorrow

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/server/drain_test.go line 104 at r3 (raw file):

		}
		if resp.DrainRemainingIndicator > 0 {
			return errors.Newf("still %d remaining, desc: %s", resp.DrainRemainingIndicator,

Another issue I found in CI stressrace is that the test sometimes fails with:

          drain_test.go:98: condition failed to evaluate within 3m45s: from drain_test.go:104: still 2 remaining, desc: range lease iterations: 2

I wonder what is causing these 2 ranges to get stuck

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/server/drain_test.go line 104 at r3 (raw file):

Previously, iskettaneh wrote…

Another issue I found in CI stressrace is that the test sometimes fails with:

          drain_test.go:98: condition failed to evaluate within 3m45s: from drain_test.go:104: still 2 remaining, desc: range lease iterations: 2

I wonder what is causing these 2 ranges to get stuck

as a debugging measure, I added a few log lines and increased the timeout. The hope is to catch this in CI and understand the problem better

@iskettaneh iskettaneh force-pushed the drain branch 3 times, most recently from b65a126 to 6dadc3d Compare October 9, 2025 13:25
Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/server/drain_test.go line 104 at r3 (raw file):

Previously, iskettaneh wrote…

as a debugging measure, I added a few log lines and increased the timeout. The hope is to catch this in CI and understand the problem better

I didn't find I clear issue. It just seemed that under stress race, the system is under too much load, and it takes a long time for the tenant servers to drain, and then draining the leases times out

@rafiss
Copy link
Collaborator

rafiss commented Oct 9, 2025

This seems like it would be a noticeable bug -- should this change get a release note? (Especially if we plan to backport to all releases.)

Also, I'm wondering about the impact of the bug. Does it affect UA clusters?

@stevendanna
Copy link
Collaborator

Does it affect UA clusters?

Yes, I believe it likely does.

@stevendanna
Copy link
Collaborator

This seems like it would be a noticeable bug -- should this change get a release note? (Especially if we plan to backport to all releases.)

Yes, we should definitely have a release note.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)


pkg/server/drain_test.go line 85 at r1 (raw file):

Previously, iskettaneh wrote…

It actually fails on this check:

t.assertDraining(resp, true)

Because draining doesn't get set in the response, which is something the cli doesn't expect in the normal case.

Ah, I see. I think that is OK for this PR. I

I think my overall concern is that we didn't have a test that made sure the actual thing drain is trying to do (remove leases, gracefully end sql connections) actually happened.


-- commits line 6 at r5:
missing a word:

"without populating the IsDraining" -> "without populating the IsDraining field"

if you fancy:

"The CLI interprets this as if the draining is completed." -> "The CLI interprets this to mean draining is complete"


-- commits line 8 at r5:
Looks like something might have have happened during a rebase/squash as this is duplicated.


-- commits line 36 at r5:
We could spell out the consequence a bit more:

Fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.


pkg/server/drain_test.go line 47 at r5 (raw file):

func doTestDrain(tt *testing.T, secondaryTenants bool) {
	if secondaryTenants {
		// Draining the tenant server takes a long time under stress.

Given the unfortunate state of the server controller, I think this is probably reasonable for now. We could open an issue to track it for the db-server-team.

This commit fixes a bug where if the server controller is being drained,
the drain command would return a response without populating the
IsDraining field. The CLI interprets this as if the draining is
completed.

Output example when running:

```
./cockroach node drain 3 --insecure --url {pgurl:3} --logtostderr=INFO

I251008 15:54:42.613200 15 2@rpc/peer.go:613  [rnode=?,raddr=10.142.1.228:26257,class=system,rpc] 1  connection is now healthy
node is draining... remaining: 3
I251008 15:54:42.622586 1 cli/rpc_node_shutdown.go:184  [-] 2  drain details: tenant servers: 3
node is draining... remaining: 3
I251008 15:54:42.824526 1 cli/rpc_node_shutdown.go:184  [-] 3  drain details: tenant servers: 3
node is draining... remaining: 1
I251008 15:54:43.026405 1 cli/rpc_node_shutdown.go:184  [-] 4  drain details: tenant servers: 1
node is draining... remaining: 1
I251008 15:54:43.228596 1 cli/rpc_node_shutdown.go:184  [-] 5  drain details: tenant servers: 1
node is draining... remaining: 243
I251008 15:54:44.580413 1 cli/rpc_node_shutdown.go:184  [-] 6  drain details: liveness record: 1, range lease iterations: 175, descriptor leases: 67
node is draining... remaining: 0 (complete)
drain ok
```

Release note (bug fix): fixed a bug in the drain command where draining
a node using virtual clusters (such as clusters running Physical
Cluster Replication) could return before the drain was complete,
possibly resulting in shutting down the node while it still had active
SQL clients and ranges leases.

Epic: None
Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


-- commits line 6 at r5:

Previously, stevendanna (Steven Danna) wrote…

missing a word:

"without populating the IsDraining" -> "without populating the IsDraining field"

if you fancy:

"The CLI interprets this as if the draining is completed." -> "The CLI interprets this to mean draining is complete"

Done.


-- commits line 8 at r5:

Previously, stevendanna (Steven Danna) wrote…

Looks like something might have have happened during a rebase/squash as this is duplicated.

Done.


-- commits line 36 at r5:

Previously, stevendanna (Steven Danna) wrote…

We could spell out the consequence a bit more:

Fixed a bug in the drain command where draining a node using virtual clusters (such as clusters running Physical Cluster Replication) could return before the drain was complete, possibly resulting in shutting down the node while it still had active SQL clients and ranges leases.

Thank you!


pkg/server/drain_test.go line 85 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Ah, I see. I think that is OK for this PR. I

I think my overall concern is that we didn't have a test that made sure the actual thing drain is trying to do (remove leases, gracefully end sql connections) actually happened.

I agree with you


pkg/server/drain_test.go line 47 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Given the unfortunate state of the server controller, I think this is probably reasonable for now. We could open an issue to track it for the db-server-team.

Done!
#155229

@iskettaneh
Copy link
Contributor Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 10, 2025

@craig craig bot merged commit d1ebc69 into cockroachdb:master Oct 10, 2025
24 of 25 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Oct 10, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-24.1-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-24.1 failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-24.3-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-24.3 failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-25.2-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.2 failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-25.3-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.3 failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 46f5bbe to blathers/backport-release-25.4-155063: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.4 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-all Flags PRs that need to be backported to all supported release branches backport-failed target-release-26.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants