[server] Coordinator Server Supports High-Available #1401

zcoo · 2025-07-24T08:14:58Z

Purpose

Coordinator Server Supports High-Available.

Linked issue: close #188

Brief change log

Tests

see : com.alibaba.fluss.server.coordinator.CoordinatorServerElectionTest

API and Format

Documentation

Currently I make it by using an active (or leader) coordinator server and several standby (or alive) coordinator server. When several coordinator servers start up at the same time, only one can successfully preempt the ZK node and win the election and become active coordinator server leader. Once it failovers, the standby servers will take over it with an leader election process.

zcoo · 2025-07-24T08:15:26Z

not ready for review

michaelkoepf · 2025-07-25T15:36:25Z

@zcoo if it is not ready for review, you can click on Convert to draft in the right side bar, and, optionally, add [WIP] as prefix in the title of the PR.

zcoo · 2025-07-28T01:50:53Z

@michaelkoepf i get it, thank you

zcoo · 2025-07-29T07:11:59Z

Ready for review now! @wuchong @swuferhong

LB-Yu · 2025-07-29T08:24:26Z

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

zcoo · 2025-07-29T09:16:36Z

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.

coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions?
@LB-Yu @wuchong @swuferhong

LB-Yu · 2025-08-06T08:04:14Z

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.

coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions? @LB-Yu @wuchong @swuferhong

IMO, we can split them into different PRs, but they should be merged together? Otherwise, we might introduce other metadata inconsistency issues when introducing HA. What's your opinion?

LB-Yu · 2025-08-06T08:06:03Z

fluss-server/src/main/java/com/alibaba/fluss/server/coordinator/CoordinatorLeaderElection.java

+                                // Do not return, otherwise the leader will be released immediately.
+                                while (true) {
+                                    try {
+                                        Thread.sleep(1000);
+                                    } catch (InterruptedException e) {
+                                    }
+                                }


Why we don't use LeaderLatch here if we need to hold the leadership?

Very good suggestion to me! It seems LeaderLatch is more suitable in this scene and I will try to use it instead

zcoo force-pushed the 20250723_coordinator_ha branch 4 times, most recently from 3afaff9 to 5bf08aa Compare July 25, 2025 07:26

zcoo marked this pull request as draft July 28, 2025 01:50

zcoo force-pushed the 20250723_coordinator_ha branch 5 times, most recently from 830229d to 9d8f97e Compare July 29, 2025 01:44

zcoo marked this pull request as ready for review July 29, 2025 07:10

LB-Yu reviewed Aug 6, 2025

View reviewed changes

zcoo added 6 commits August 8, 2025 14:42

[server] Coordinator Server Supports High-Available

5128388

[server] Coordinator Server Supports High-Available

51e8925

[server] Coordinator Server Supports High-Available

b1c6650

[server] Coordinator Server Supports High-Available

a3d538f

[server] Coordinator Server Supports High-Available

2f52ffe

[server] Coordinator Server Supports High-Available

e237eee

zcoo force-pushed the 20250723_coordinator_ha branch from 223a8d4 to 84c1db7 Compare August 8, 2025 06:47

user LeaderLatch instead of LeaderSelector

18a3ff8

zcoo force-pushed the 20250723_coordinator_ha branch from 84c1db7 to 18a3ff8 Compare August 8, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[server] Coordinator Server Supports High-Available #1401

[server] Coordinator Server Supports High-Available #1401

Uh oh!

zcoo commented Jul 24, 2025 •

edited

Loading

Uh oh!

zcoo commented Jul 24, 2025

Uh oh!

michaelkoepf commented Jul 25, 2025

Uh oh!

zcoo commented Jul 28, 2025

Uh oh!

zcoo commented Jul 29, 2025

Uh oh!

LB-Yu commented Jul 29, 2025 •

edited

Loading

Uh oh!

zcoo commented Jul 29, 2025

Uh oh!

LB-Yu commented Aug 6, 2025

Uh oh!

LB-Yu Aug 6, 2025

Uh oh!

zcoo Aug 6, 2025

Uh oh!

Uh oh!

[server] Coordinator Server Supports High-Available #1401

Are you sure you want to change the base?

[server] Coordinator Server Supports High-Available #1401

Uh oh!

Conversation

zcoo commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

zcoo commented Jul 24, 2025

Uh oh!

michaelkoepf commented Jul 25, 2025

Uh oh!

zcoo commented Jul 28, 2025

Uh oh!

zcoo commented Jul 29, 2025

Uh oh!

LB-Yu commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcoo commented Jul 29, 2025

Uh oh!

LB-Yu commented Aug 6, 2025

Uh oh!

LB-Yu Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

zcoo Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zcoo commented Jul 24, 2025 •

edited

Loading

LB-Yu commented Jul 29, 2025 •

edited

Loading