Skip to content
This repository was archived by the owner on Mar 28, 2020. It is now read-only.
This repository was archived by the owner on Mar 28, 2020. It is now read-only.

How can the operator recover from self-hosted cluster disasters? #1559

@jamiehannaford

Description

@jamiehannaford

After reading kubernetes-retired/bootkube#738 I'd like to understand some of the edge cases that affect using self-hosted etcd for the cluster, especially with HA and disaster recovery.

If the etcd-operator is being used to manage a cluster's etcd, that cluster is fully self-hosted, but what happens when etcd itself runs into a non-recoverable error which brings down the apiserver? Since the operator needs to talk to the apiserver to reconcile etcd state (e.g. add/delete pods), it gets caught in a failure loop. What I'd expect to happen:

  1. operator cannot talk to the apiserver and a etcd client reports the cluster as unhealthy/dead
  2. operator creates a static pod using a manifest that was previously checkpointed. this static pod restores from a backup and also uses the same label selectors, meaning that the apiserver can now connect to it
  3. after bootstrap, the operator creates a new seed member and begins the pivots

A few questions:

  • Is this the best way to recover this disaster, or is there a better way?
  • Is it the responsibility of the etcd-operator to maintain control plane stability? In the above case it assumes the etcd-cluster is backing a K8s cluster, whereas right now no such assumption is made (all etcd-clusters are generic).
  • What kinds of failures would bring down an etcd cluster in the first place?

/cc @aaronlevy I know you've thought a lot about this, so would be great to hear your opinions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions