You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 28, 2020. It is now read-only.
After reading kubernetes-retired/bootkube#738 I'd like to understand some of the edge cases that affect using self-hosted etcd for the cluster, especially with HA and disaster recovery.
If the etcd-operator is being used to manage a cluster's etcd, that cluster is fully self-hosted, but what happens when etcd itself runs into a non-recoverable error which brings down the apiserver? Since the operator needs to talk to the apiserver to reconcile etcd state (e.g. add/delete pods), it gets caught in a failure loop. What I'd expect to happen:
operator cannot talk to the apiserver and a etcd client reports the cluster as unhealthy/dead
operator creates a static pod using a manifest that was previously checkpointed. this static pod restores from a backup and also uses the same label selectors, meaning that the apiserver can now connect to it
after bootstrap, the operator creates a new seed member and begins the pivots
A few questions:
Is this the best way to recover this disaster, or is there a better way?
Is it the responsibility of the etcd-operator to maintain control plane stability? In the above case it assumes the etcd-cluster is backing a K8s cluster, whereas right now no such assumption is made (all etcd-clusters are generic).
What kinds of failures would bring down an etcd cluster in the first place?
/cc @aaronlevy I know you've thought a lot about this, so would be great to hear your opinions.