HA Control-Plane Node Failover

Purpose

ETCD cluster environment can not add more control-plane nodes if the cluster is in unhealthy status. This situation is seen when one of the 3 control-plane nodes goes down and the ETCD goes into unhealthy status. This document takes steps to remove the faulted CP node from the server and then add a new node as CP into the cluster.

In this scenario, we are going to remove a faulted control-plane node called devops230.ef.com from the cluster. If you try to add another control plane node without removing the NotReady node, you will get the following error message in the logs.

time="2022-11-20T17:08:24+05:00" level=fatal msg="ETCD join failed: etcdserver: unhealthy cluster"

k3s.service: Main process exited, code=exited, status=1/FAILURE

k3s.service: Failed with result 'exit-code'.

Failed to start Lightweight Kubernetes.

CODE

# kubectl get node -o wide
NAME               STATUS     ROLES                       AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
devops230.ef.com   NotReady   control-plane,etcd,master   42m   v1.24.7+k3s1   192.168.2.230   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   containerd://1.6.8-k3s1
devops231.ef.com   Ready      control-plane,etcd,master   38m   v1.24.7+k3s1   192.168.2.231   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   containerd://1.6.8-k3s1
devops232.ef.com   Ready      control-plane,etcd,master   39m   v1.24.7+k3s1   192.168.2.232   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   containerd://1.6.8-k3s1
devops233.ef.com   Ready      <none>                      37m   v1.24.7+k3s1   192.168.2.233   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   containerd://1.6.8-k3s1
devops234.ef.com   Ready      <none>                      39m   v1.24.7+k3s1   192.168.2.234   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   containerd://1.6.8-k3s1

devops230.ef.com is currently down and the cluster is unable to sync with it. New Control-Plane nodes can not be added as well.

etcd version

Export all required variables to connect to the ETCD server inside the K3s Server.

CODE

export ETCDCTL_ENDPOINTS='https://127.0.0.1:2379'
export ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt'
export ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt'
export ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key'
export ETCDCTL_API=3

on a spare control-plane node get the deployed etcd server version by

CODE

curl -L  https://127.0.0.1:2379/version

sample output

CODE

{"etcdserver":"3.5.3","etcdcluster":"3.5.0"}

in this case, the version of the ETCD Server is 3.5.3. Use the below-given code snippet and change the ETCD_VER to the version from the previous command.

CODE

ETCD_VER=v3.5.2

# choose either URL
GOOGLE_URL=https://storage.googleapis.com/etcd
GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
DOWNLOAD_URL=${GOOGLE_URL}

rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test

curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz

/tmp/etcd-download-test/etcd --version
/tmp/etcd-download-test/etcdctl version
/tmp/etcd-download-test/etcdutl version

 cp /tmp/etcd-download-test/{etcd,etcdctl,etcdutl} /usr/local/bin/

Get the list of Members

CODE

etcdctl member list

The sample output will look like

CODE

318950680619d022, started, devops232.ef.com-5fc7a280, https://192.168.2.232:2380, https://192.168.2.232:2379, false
69619e9cd4388b05, started, devops230.ef.com-0638e58e, https://192.168.2.230:2380, https://192.168.2.230:2379, false
9ed2c4ae5a856ded, started, devops231.ef.com-cdfe7233, https://192.168.2.231:2380, https://192.168.2.231:2379, false

As we are going to remove the faulted node "devops230.ef.com" from the cluster, note the ide returned from above command in first column and remove it from the ETCD cluster

CODE

etcdctl member remove    69619e9cd4388b05

once the node is removed from ETCD, drain the workload on the node from the k3s cluster using

CODE

kubectl drain <node_name>  --delete-emptydir-data  --ignore-daemonsets

and then delete the node by running

CODE

kubectl delete node <node_name>

Get the nodes list in the cluster

CODE

# kubectl get nodes
NAME               STATUS     ROLES                       AGE   VERSION
devops231.ef.com   Ready      control-plane,etcd,master   76m   v1.24.7+k3s1
devops232.ef.com   Ready      control-plane,etcd,master   78m   v1.24.7+k3s1
devops233.ef.com   Ready      <none>                      76m   v1.24.7+k3s1
devops234.ef.com   Ready      <none>                      78m   v1.24.7+k3s1

Add the new Control-Plane Node

CODE

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.24.7+k3s1  K3S_TOKEN=<TOKEN> sh -s server --server https://<SAN-IP>:6443 --disable traefik,local-storage --tls-san <SAN-IP>

Get the nodes in the cluster

CODE

# kubectl get nodes
NAME               STATUS     ROLES                       AGE   VERSION
devops231.ef.com   Ready      control-plane,etcd,master   76m   v1.24.7+k3s1
devops232.ef.com   Ready      control-plane,etcd,master   78m   v1.24.7+k3s1
devops233.ef.com   Ready      <none>                      76m   v1.24.7+k3s1
devops234.ef.com   Ready      <none>                      78m   v1.24.7+k3s1
devops243.ef.com   Ready      control-plane,etcd,master   91s   v1.24.7+k3s1

Wait for some time to let the ETCD server sync the data

CODE

# kubectl get cs
NAME                 STATUS    MESSAGE                         ERROR
etcd-0               Healthy   {"health":"true","reason":""}
scheduler            Healthy   ok
controller-manager   Healthy   ok