How can the operator recover from self-hosted cluster disasters?

After reading https://github.com/kubernetes-incubator/bootkube/issues/738 I'd like to understand some of the edge cases that affect using self-hosted etcd for the cluster, especially with HA and disaster recovery.

If the etcd-operator is being used to manage a cluster's etcd, that cluster is fully self-hosted, but what happens when etcd itself runs into a non-recoverable error which brings down the apiserver? Since the operator needs to talk to the apiserver to reconcile etcd state (e.g. add/delete pods), it gets caught in a failure loop. What I'd expect to happen:

1. operator cannot talk to the apiserver and a etcd client reports the cluster as unhealthy/dead
2. operator creates a static pod using a manifest that was previously checkpointed. this static pod restores from a backup and also uses the same label selectors, meaning that the apiserver can now connect to it 
3. after bootstrap, the operator creates a new seed member and begins the pivots

A few questions: 
- Is this the best way to recover this disaster, or is there a better way?
- Is it the responsibility of the etcd-operator to maintain control plane stability? In the above case it assumes the etcd-cluster is backing a K8s cluster, whereas right now no such assumption is made (all etcd-clusters are generic).
- What kinds of failures would bring down an etcd cluster in the first place?

/cc @aaronlevy I know you've thought a lot about this, so would be great to hear your opinions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can the operator recover from self-hosted cluster disasters? #1559

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can the operator recover from self-hosted cluster disasters? #1559

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions