Kubernetes gone bust. Now what?

January 17, 2021

We've been operating a few Kubernetes clusters. Someone trips over, falls on a keyboard, and deletes several services. We need to (quickly!) get those back online.

We have several options to get things back to how they were:

  • we have everything in version control - pipelines or GitOps reconcilers will take care of it;
  • restore ectd backup - all Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data can be a lifesaver under disaster scenarios;
  • use specific Kubernetes backup tools - for example Velero.

A tool like Velero is great since it makes backups of Kubernetes objects, as well as, instructing your cloud provider to make backups of PersistentVolumes. That said, this has a ramp-up and we need something now. Backing up our etcd cluster is always a safe bet and there are ways of doing that.

For a while now I've been a fan of Earliest Testable/Usable/Lovable as an "opposition" to MVP.


With this in mind, what we want is a fast way to have a safety net (skate) in case something goes wrong. Fortunately, etcd come equipped with built-in snapshot capabilities.

Backup etcd

We need to identify a few things from the etcd deployment in order to make a backup.

- command:
- etcd
- --advertise-client-urls=
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=
- --initial-cluster=backup-control-plane=
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=,
- --listen-metrics-urls=
- --listen-peer-urls=
- --name=backup-control-plane
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Armed with advertise-client-urls, cert-file, key-file and trusted-ca-file values we can:

ETCDCTL_API=3 etcdctl --endpoints \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
snapshot save snapshotdb
{"level":"info","ts":1610913776.2521563,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb.part"}
{"level":"info","ts":"2021-01-17T20:02:56.256Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1610913776.2563014,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":""}
{"level":"info","ts":"2021-01-17T20:02:56.273Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1610913776.2887816,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"","size":"3.6 MB","took":0.036583317}
Snapshot saved at snapshotdb

To be safe we can ensure the backup is ok:

ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
| 9b193bf0 | 1996 | 2009 | 2.7 MB |

Restore etcd

kube-apiserver uses etcd to store and retrieve information and, as such, we need to stop ip first. This will depend on how you have kube-apiserver configured. Next, we restore etcd:

ETCDCTL_API=3 etcdctl snapshot restore snapshotdb --data-dir="/var/lib/etcd-restore"
{"level":"info","ts":1610913810.5761065,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
{"level":"info","ts":1610913810.599168,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":7655}
{"level":"info","ts":1610913810.60404,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1610913810.6153672,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}

We need to tell etcd to use this data folder and once it's up-and-running bring kube-apiserver back online:

- hostPath:
path: /var/lib/etcd-restore
type: DirectoryOrCreate
name: etcd-data

Although this looks a bit clunky it's an easy way (skate again) to ensure a safety net in case of disaster while buying time to work a more capable solution (scooter -> bicycle -> motorcycle -> car). It might even come to the point where, for example, the bicycle is good enough.

Ricardo Castro

Ricardo Castro

Software Engineering, DevOps, SRE, Taekwondo and Metal



© 2021