-
-
Notifications
You must be signed in to change notification settings - Fork 117
Closed
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Milestone
Description
After seeing the following in logs when cluster couldn't start itself or even start clean if all etcd pods were killed:
level=warning msg="all etcd pods are dead." cluster-name=etcd-cluster cluster-namespace=default pkg=cluster
This situation is not recovered by etcd-operator.
https://github.com/coreos/etcd-operator/blob/8347d27afa18b6c76d4a8bb85ad56a2e60927018/pkg/cluster/cluster.go#L248-L252
Researching further looks like there are quite a lot of cases when etcd-operator can't recover itself:
- Fail the cluster when all etcd pods are dead and there is no way to recover. coreos/etcd-operator#1973
- How can the operator recover from self-hosted cluster disasters? coreos/etcd-operator#1559
- etcd-operator does not recover an etcd cluster if it loses quorum coreos/etcd-operator#1972
- EtcdCluster condition is incorrect coreos/etcd-operator#2044
Because this backend is needed just for short-lived coordination locks, consider switching to Redis or even single-instance etcd like it was before (#52)?
danielburrell
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed