KAFKA-19078: Automatic controller addition to cluster metadata partition #19589

kevin-wu24 · 2025-04-28T21:14:02Z

Add the controller.quorum.auto.join.enable configuration. When enabled
with KIP-853 supported, follower controllers who are observers (their
replica id + directory id are not in the voter set) will:

Automatically remove voter set entries which match their replica id
but not directory id by sending the RemoveVoterRPC to the leader.
Automatically add themselves as a voter when their replica id is not
present in the voter set by sending the AddVoterRPC to the leader.

Reviewers: José Armando García Sancio
[email protected]

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

jsancio

Thanks for the feature @kevin-wu24 . Reviewed src/main. Need to review the tests.

raft/src/main/java/org/apache/kafka/raft/FollowerState.java

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

raft/src/main/java/org/apache/kafka/raft/VoterSet.java

jsancio · 2025-05-01T17:43:46Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+        } else if (partitionState.lastKraftVersion().isReconfigSupported() && followersAlwaysFlush &&
+            quorumConfig.autoJoinEnable() && state.hasAddRemoveVoterPeriodExpired(currentTimeMs)) {


Okay. I think we should document why we require both followersAlwaysFlush and autoJoinEnable to be true.

raft/src/main/java/org/apache/kafka/raft/QuorumConfig.java

raft/src/test/java/org/apache/kafka/raft/RaftClientTestContext.java

raft/src/main/java/org/apache/kafka/raft/QuorumConfig.java

jsancio · 2025-05-01T17:52:07Z

Feature description: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217391519#KIP853:KRaftControllerMembershipChanges-Controllerautojoining

Add the controller.quorum.auto.join.enable configuration. When enabled with KIP-853 supported, follower controllers who are observers (their replica id + directory id are not in the voter set) will:

Automatically remove voter set entries which match their replica id but not directory id.

Automatically add themselves as a voter when their replica id is not present in the voter set.

I am going to use this description to generate the commit message. I don't think we should include URLs in the commit description. Ideally, the commit message should explain the changes without having to read other documents.

jsancio

Thanks for the changes.

raft/src/main/java/org/apache/kafka/raft/FollowerState.java

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

raft/src/main/java/org/apache/kafka/raft/RaftUtil.java

jsancio · 2025-05-06T16:15:55Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+        RaftResponse.Inbound responseMetadata,
+        long currentTimeMs
+    ) {
+        RemoveRaftVoterResponseData data = (RemoveRaftVoterResponseData) responseMetadata.data();


This cast to RemoveRaftVoterResponseData yet the method is for handleAddVoterResponse. Do the tests pass for you? How is that possible with this cast?

I think this is because we are not testing the response handling anywhere. In the unit tests we are only asserting that the add/remove voter request was sent when we expect it.

I think we should do something like with fetch where we complete the request (I assume this testing method executes the handleFetchResponse). I need to see how we implement this for fetch.

Doesn't KafkaRaftClientAutoJoinTest send add voter and remove voter responses to the replica?

Yeah it does, I just looked at the code again. The tests do pass, but this is because this line in the test is sending a remove voter response: https://github.com/apache/kafka/pull/19589/files#diff-7b1538d9f4f7f1e27a444aac55ced6f780e22bfdd8af4462e82d62c24f778c85R94.
Both the implementation and test need to be fixed, thanks for the catch.

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

jsancio · 2025-05-06T19:29:55Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+    private static final int NUMBER_FETCH_TIMEOUTS_IN_ADD_REMOVE_PERIOD = 1;
+
+    @Test
+    public void testAutoRemoveOldVoter() throws Exception {


Do we need a test that fully does a remove followed by an add? E.g.

Start with the local replica not in the voter set but the id in the voter set.

Remove voter is sent and acknowledged.

Next FETCH response send the VOTER_RECORD control batch without the voter in the old voter in the voter set.

Add voter is sent and acknowledged.

Next FETCH response send a VOTER_RECORD control btach with the local replica in the voter set.

I think we can add one since it acts as like a pseudo-integration test for the feature. I'm not seeing anywhere in KafkaRaftClientReconfigTest or KafkaRaftClientFetchTest that covers step 2. I see a KafkaRaftClientTest#testFollowerReplication but it doesn't add control records in the FETCH response when kraft.version == 1.

jsancio · 2025-05-06T19:30:41Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+        // after sending a remove voter the next request should be a fetch
+        context.pollUntilRequest();
+        var fetchRequest = context.assertSentFetchRequest();
+        context.assertFetchRequestData(fetchRequest, epoch, 0L, 0);


After this, I think we should check that remove voter is sent after completeFetch.

I'm a bit confused. Why would we do this check again? I'm already doing this check from L52-56.

I think we need to cover the case where the new voter sent a add/remove voter RPC. The RPC was acknowledge but the log was never updated (fetch responses didn't include the updated voter set). In this case the new voter will send another add/remove voter RPC after "update voter set period timer", right?

Okay. Since the local replica is still an observer, it should still try to add/remove itself if the log hasn't been updated with fetch.

jsancio · 2025-05-06T19:31:14Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+        // after sending an add voter the next request should be a fetch
+        context.pollUntilRequest();
+        var fetchRequest = context.assertSentFetchRequest();
+        context.assertFetchRequestData(fetchRequest, epoch, 0L, 0);


After this, I think we should check that add voter is sent after another completeFetch?

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

…Channel

jsancio

Thanks for the great tests. We should be able to merge this improvement soon.

jsancio · 2025-05-07T15:02:49Z

raft/src/main/java/org/apache/kafka/raft/KafkaNetworkChannel.java

@@ -178,6 +182,7 @@ public void pollOnce() {
        requestThread.doWork();
    }

+    @SuppressWarnings("NPathComplexity")


Try using if (...) { return } else if (...) { return } ... and see if that reduces the path complexity so you can remove this suppression.

You can also try using Java's new pattern matching and lambda syntax for switch statements. E.g.

return switch (requestData) { case VoterRequestData voterData -> new VoterRequest.Builder(voterData); ... default -> throw new IllegalArgumentException(...); }

I think I did this, but I got a build error when running ./gradlew jar. It's pretty weird, because we're supposed to be on Java 17 according to gradle, and my IDE said that this kind of instanceof switch matching is supported in Java 17+. Let me try this again and put the actual error message here.

This is what I get when running ./gradlew jar:

error: patterns in switch statements are a preview feature and are disabled by default. ... (use --enable-preview to enable patterns in switch statements)

Try using if (...) { return } else if (...) { return } ... and see if that reduces the path complexity so you can remove this suppression.

This worked, thanks!

Yeah. I think I have seen this before. I think Java 21 is the oldest release that implemented pattern matching.

jsancio · 2025-05-07T15:17:45Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

            )
        );
    }

    private long pollFollowerAsObserver(FollowerState state, long currentTimeMs) {
        if (state.hasFetchTimeoutExpired(currentTimeMs)) {
            return maybeSendFetchToAnyBootstrap(currentTimeMs);
+        } else if (partitionState.lastKraftVersion().isReconfigSupported() && canBecomeVoter &&
+            quorumConfig.autoJoin() && state.hasUpdateVoterSetPeriodExpired(currentTimeMs)) {
+            /* Only replicas that are always flushing and are configured to auto join should


Replace "are always flushing" with "can become a voter."

jsancio · 2025-05-07T15:31:28Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+                    0,
+                    epoch,
+                    BufferSupplier.NO_CACHING.get(300),
+                    VoterSetTest.voterSet(Stream.of(leader)).toVotersRecord((short) 0)),


Missing newline.

VoterSetTest.voterSet(Stream.of(leader)).toVotersRecord((short) 0) ),

jsancio · 2025-05-07T15:33:15Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+                    0,
+                    epoch,
+                    BufferSupplier.NO_CACHING.get(300),
+                    VoterSetTest.voterSet(Stream.of(leader, newFollowerKey)).toVotersRecord((short) 0)),


Missing newline.

VoterSetTest.voterSet(Stream.of(leader, newFollowerKey)).toVotersRecord((short) 0) ),

jsancio · 2025-05-07T15:34:47Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+            )
+        );
+        // poll kraft to update the replica's voter set
+        context.client.poll();


How about adding this time advancement and showing that the replica sent a fetch request?

context.advanceTimeAndFetchToUpdateVoterSetTimer(epoch, leader.id()); context.time.sleep(context.fetchTimeoutMs - 1);

Just added and cleaning up this test file for readability, since there's a lot of duplicate code.

I added the sleep method as part of the advanceTimeAndCompleteFetch helper, since that is what actually expires the timer. I also added a boolean flag to the method to determine whether sleep is called again.

…eck follower can fetch while add voter request is pending

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

jsancio

@kevin-wu24 , can you resolve the conflicts?

raft/src/main/java/org/apache/kafka/raft/KafkaNetworkChannel.java

jsancio

Thanks for the changes @kevin-wu24 . Partial review.

jsancio · 2025-08-05T14:13:01Z

core/src/test/java/kafka/server/ReconfigurableQuorumIntegrationTest.java

+                    Map<Integer, Uuid> voters = findVoterDirs(admin);
+                    assertEquals(new HashSet<>(List.of(3000, 3001, 3002)), voters.keySet());
+                    for (int replicaId : new int[] {3000, 3001, 3002}) {
+                        assertNotEquals(Uuid.ZERO_UUID, voters.get(replicaId));


Okay. Is there a way to get the exact directory id (UUID) and compare against that instead?

Yeah, I can check that each replica has the exact metadata dir ID as what is in the TestKitNodes.

jsancio · 2025-08-05T14:24:33Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+            quorumConfig.requestTimeoutMs(),
+            quorum.localReplicaKeyOrThrow(),
+            localListeners,
+            !quorumConfig.autoJoin()


Why is this the negative of auto join? Shouldn't it always be false? If KRaft send an "add voter" request, it should always be version 1 and return before committing.

jsancio · 2025-08-05T14:26:43Z

raft/src/main/java/org/apache/kafka/raft/RaftUtil.java

@@ -524,14 +526,16 @@ public static AddRaftVoterRequestData addVoterRequest(
        String clusterId,
        int timeoutMs,
        ReplicaKey voter,
-        Endpoints listeners
+        Endpoints listeners,
+        boolean ackWhenCommitted


This is the add voter request specific for the kraft implementation. "Ack when committed" should always be false. If that's true then let's remove this parameter and not give the caller the option to set it. In the implementation, the method should always call setAckWhenCommitted(false).

"Ack when committed" should always be false

I don't think this is true, since there are callers of this method in KafkaRaftClientReconfigTest that do test sending an AddVoterRequest with ackWhenCommitted == true since it is testing the leader state when receiving this RPC. I agree that KafkaRaftClient should always call this method with false.

I see. That's fair.

jsancio · 2025-08-05T14:30:43Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+        long currentTimeMs,
+        ReplicaKey replicaKey


Flip the order of these parameters. The kraft module has a pattern of using the last parameter as the current time when needed.

jsancio · 2025-08-05T14:36:32Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+        pollAndDeliverFetchToUpdateVoterSet(context, epoch,
+            VoterSetTest.voterSet(Stream.of(leader, newVoter)));


Let's fix this indentation. How about:

pollAndDeliverFetchToUpdateVoterSet( context, epoch, VoterSetTest.voterSet(Stream.of(leader, newVoter)) );

jsancio · 2025-08-05T14:39:54Z

raft/src/test/java/org/apache/kafka/raft/RaftClientTestContext.java

+    RaftRequest.Outbound assertSentAddVoterRequest(
+        ReplicaKey replicaKey,
+        Endpoints endpoints,
+        boolean expectedAckWhenCommitted


Let's not give the caller the option to override this. This value should always be false and this method should just check for that explicitly.

jsancio · 2025-08-05T14:55:47Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+                    BufferSupplier.NO_CACHING.get(300),
+                    newVoterSet.toVotersRecord((short) 0)
+                ),
+                context.log.endOffset().offset() + 1,


This assumes that the voter set only has one voter hence one record.

Why does this assume the voter set only has one voter? Which voter set are you referencing?

testObserverRemovesOldVoterAndAutoJoins has the voter set go from size 2, to size 1, and then back to size 2 by having a follower node complete the whole "auto-join" flow (i.e. remove its old self and and its new self to the voter set).

jsancio · 2025-08-05T14:56:48Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java

+                fetchRequest.destination().id(),
+                MemoryRecords.withVotersRecord(
+                    context.log.endOffset().offset(),
+                    0,


Hmm. Maybe we can use context.time.milliseconds().

jsancio

Thanks @kevin-wu24 . I think we should be able to merge this soon.

jsancio · 2025-08-06T15:18:03Z

raft/src/main/java/org/apache/kafka/raft/RaftUtil.java

@@ -524,14 +526,16 @@ public static AddRaftVoterRequestData addVoterRequest(
        String clusterId,
        int timeoutMs,
        ReplicaKey voter,
-        Endpoints listeners
+        Endpoints listeners,
+        boolean ackWhenCommitted


I see. That's fair.

jsancio · 2025-08-06T21:24:08Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+                )
+            );
+        } else {
+            return maybeSendFetchToBestNode(state, currentTimeMs);


This backoff not correct now that observers can send AddVoter and RemoveVoter requests. Take a look how I solved it for pollFollowerAsVoter.

jsancio · 2025-08-06T21:27:21Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+        if (partitionState.lastKraftVersion().isReconfigSupported() && canBecomeVoter &&
+            quorumConfig.autoJoin() && state.hasUpdateVoterSetPeriodExpired(currentTimeMs)) {


Let's add a shouldSendAddOrRemoveVoterRequest similar to shouldSendUpdateVoterRequest. This would allow you to better document this predicate.

jsancio · 2025-08-08T14:08:34Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+        return partitionState.lastKraftVersion().isReconfigSupported() && canBecomeVoter &&
+            quorumConfig.autoJoin() && state.hasUpdateVoterSetPeriodExpired(currentTimeMs);


Please document why we need this predicate. See shouldSendUpdateVoteRequest for an example.

jsancio · 2025-08-08T14:09:15Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+            /* Only replicas that can become a voter and are configured to auto join should
+             * attempt to automatically join the voter set for the configured topic partition.
+             */


See my other comment but you can move this comment to shouldSendAddOrRemoveVoterRequest.

jsancio · 2025-08-08T14:10:17Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+        return Math.min(
+            backoffMs,
+            Math.min(
+                state.remainingFetchTimeMs(currentTimeMs),


Observer don't need to backoff until the fetch timeout since observer do not read or handle fetch timeouts.

jsancio

LGTM

kevin-wu24 added 2 commits April 28, 2025 09:06

setting up new controller.quorum.auto.join.enable config

b71fa4f

implementing controller auto join and adding tests

a09cecc

github-actions bot added triage PRs from the community kraft labels Apr 28, 2025

kevin-wu24 added 2 commits April 29, 2025 14:06

refactoring implementation

b788412

adding some more tests + cleanup

f8cb44c

kevin-wu24 commented Apr 29, 2025

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java Outdated Show resolved Hide resolved

github-actions bot removed the triage PRs from the community label Apr 30, 2025

jsancio reviewed May 1, 2025

View reviewed changes

kevin-wu24 added 2 commits May 1, 2025 14:19

code review

b904145

set add request timeout for auto join with config value

8592429

jsancio reviewed May 6, 2025

View reviewed changes

kevin-wu24 added 2 commits May 7, 2025 08:17

code review + adding supporting add+remove voter rpcs in KafkaNetwork…

8b5950a

…Channel

validate config and adding config test

244fc33

github-actions bot added the core Kafka Broker label May 7, 2025

jsancio reviewed May 7, 2025

View reviewed changes

kevin-wu24 added 6 commits May 7, 2025 11:03

cleaning up auto join test

344846b

more cleanup

6be943c

fixing build

6e9604c

adding auto join integration tests and making auto join unit tests ch…

99baac0

…eck follower can fetch while add voter request is pending

merging trunk and fix conflicts, keep both

8fc8ce7

Merge branch 'trunk' into KAFKA-19078

445b9db

kevin-wu24 commented Jun 17, 2025

View reviewed changes

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientAutoJoinTest.java Outdated Show resolved Hide resolved

kevin-wu24 commented Jun 17, 2025

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java Show resolved Hide resolved

kevin-wu24 added 4 commits June 17, 2025 17:07

merging trunk and fixing conflicts

854ca33

fixing build after merge

2acd001

merging trunk and fixing import conflict

33fa480

auto-joining replicas send AddVoterRequest with ackWhenCommitted = false

5aa9adc

jsancio reviewed Jul 30, 2025

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/KafkaNetworkChannel.java Show resolved Hide resolved

merging trunk and fixing conflicts

e72cdeb

kevin-wu24 changed the title ~~KAFKA-19078: Implement automatic controller addition to cluster metadata partition~~ KAFKA-19078: Automatic controller addition to cluster metadata partition Jul 30, 2025

jsancio added the ci-approved label Jul 30, 2025

jsancio reviewed Aug 5, 2025

View reviewed changes

code review

baf55ba

jsancio reviewed Aug 6, 2025

View reviewed changes

kevin-wu24 added 3 commits August 7, 2025 09:29

code review

cea72cd

Merge branch 'trunk' into KAFKA-19078

357b889

adding shutdown clause for pollFollowerAsObserver

13a5500

jsancio reviewed Aug 8, 2025

View reviewed changes

code review

6e46e4d

jsancio approved these changes Aug 8, 2025

View reviewed changes

		} else if (partitionState.lastKraftVersion().isReconfigSupported() && followersAlwaysFlush &&
		quorumConfig.autoJoinEnable() && state.hasAddRemoveVoterPeriodExpired(currentTimeMs)) {

		pollAndDeliverFetchToUpdateVoterSet(context, epoch,
		VoterSetTest.voterSet(Stream.of(leader, newVoter)));

		if (partitionState.lastKraftVersion().isReconfigSupported() && canBecomeVoter &&
		quorumConfig.autoJoin() && state.hasUpdateVoterSetPeriodExpired(currentTimeMs)) {

		return partitionState.lastKraftVersion().isReconfigSupported() && canBecomeVoter &&
		quorumConfig.autoJoin() && state.hasUpdateVoterSetPeriodExpired(currentTimeMs);

KAFKA-19078: Automatic controller addition to cluster metadata partition #19589

Are you sure you want to change the base?

KAFKA-19078: Automatic controller addition to cluster metadata partition #19589

Conversation

kevin-wu24 commented Apr 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jsancio left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsancio commented May 1, 2025

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jsancio May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsancio May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 commented Apr 28, 2025 •

edited by github-actions bot

Loading

jsancio left a comment •

edited

Loading

kevin-wu24 May 6, 2025 •

edited

Loading

jsancio May 6, 2025 •

edited

Loading

kevin-wu24 May 6, 2025 •

edited

Loading

kevin-wu24 May 7, 2025 •

edited

Loading

jsancio May 7, 2025 •

edited

Loading