Skip to content

[server] Coordinator Server Supports High-Available #1401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

zcoo
Copy link
Contributor

@zcoo zcoo commented Jul 24, 2025

Purpose

Coordinator Server Supports High-Available.

Linked issue: close #188

Brief change log

Tests

see : com.alibaba.fluss.server.coordinator.CoordinatorServerElectionTest

API and Format

Documentation

Currently I make it by using an active (or leader) coordinator server and several standby (or alive) coordinator server. When several coordinator servers start up at the same time, only one can successfully preempt the ZK node and win the election and become active coordinator server leader. Once it failovers, the standby servers will take over it with an leader election process.

@zcoo
Copy link
Contributor Author

zcoo commented Jul 24, 2025

not ready for review

@zcoo zcoo force-pushed the 20250723_coordinator_ha branch 4 times, most recently from 3afaff9 to 5bf08aa Compare July 25, 2025 07:26
@michaelkoepf
Copy link
Contributor

@zcoo if it is not ready for review, you can click on Convert to draft in the right side bar, and, optionally, add [WIP] as prefix in the title of the PR.

@zcoo
Copy link
Contributor Author

zcoo commented Jul 28, 2025

@michaelkoepf i get it, thank you

@zcoo zcoo marked this pull request as draft July 28, 2025 01:50
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch 5 times, most recently from 830229d to 9d8f97e Compare July 29, 2025 01:44
@zcoo zcoo marked this pull request as ready for review July 29, 2025 07:10
@zcoo
Copy link
Contributor Author

zcoo commented Jul 29, 2025

Ready for review now! @wuchong @swuferhong

@LB-Yu
Copy link
Contributor

LB-Yu commented Jul 29, 2025

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

  • coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
  • coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

@zcoo
Copy link
Contributor Author

zcoo commented Jul 29, 2025

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

  • coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
  • coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions?
@LB-Yu @wuchong @swuferhong

@LB-Yu
Copy link
Contributor

LB-Yu commented Aug 6, 2025

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

  • coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
  • coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions? @LB-Yu @wuchong @swuferhong

IMO, we can split them into different PRs, but they should be merged together? Otherwise, we might introduce other metadata inconsistency issues when introducing HA. What's your opinion?

Comment on lines 75 to 81
// Do not return, otherwise the leader will be released immediately.
while (true) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't use LeaderLatch here if we need to hold the leadership?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion to me! It seems LeaderLatch is more suitable in this scene and I will try to use it instead

@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 223a8d4 to 84c1db7 Compare August 8, 2025 06:47
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 84c1db7 to 18a3ff8 Compare August 8, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] CoordinatorServer Supports High-Available
3 participants