HA Control-Plane Node Failover
Purpose
ETCD cluster environment can not add more control-plane nodes if the cluster is in unhealthy status. This situation is seen when one of the 3 control-plane nodes goes down and the ETCD goes into unhealthy status. This document takes steps to remove the faulted CP node from the server and then add a new node as CP into the cluster.
In this scenario, we are going to remove a faulted control-plane node called devops230.ef.com from the cluster. If you try to add another control plane node without removing the NotReady node, you will get the following error message in the logs.
time="2022-11-20T17:08:24+05:00" level=fatal msg="ETCD join failed: etcdserver: unhealthy cluster"
k3s.service: Main process exited, code=exited, status=1/FAILURE
k3s.service: Failed with result 'exit-code'.
Failed to start Lightweight Kubernetes.
# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
devops230.ef.com NotReady control-plane,etcd,master 42m v1.24.7+k3s1 192.168.2.230 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 containerd://1.6.8-k3s1
devops231.ef.com Ready control-plane,etcd,master 38m v1.24.7+k3s1 192.168.2.231 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 containerd://1.6.8-k3s1
devops232.ef.com Ready control-plane,etcd,master 39m v1.24.7+k3s1 192.168.2.232 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 containerd://1.6.8-k3s1
devops233.ef.com Ready <none> 37m v1.24.7+k3s1 192.168.2.233 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 containerd://1.6.8-k3s1
devops234.ef.com Ready <none> 39m v1.24.7+k3s1 192.168.2.234 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 containerd://1.6.8-k3s1
devops230.ef.com is currently down and the cluster is unable to sync with it. New Control-Plane nodes can not be added as well.
etcd version
Export all required variables to connect to the ETCD server inside the K3s Server.
export ETCDCTL_ENDPOINTS='https://127.0.0.1:2379'
export ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt'
export ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt'
export ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key'
export ETCDCTL_API=3
on a spare control-plane node get the deployed etcd server version by
curl -L https://127.0.0.1:2379/version
sample output
{"etcdserver":"3.5.3","etcdcluster":"3.5.0"}
in this case, the version of the ETCD Server is 3.5.3. Use the below-given code snippet and change the ETCD_VER to the version from the previous command.
ETCD_VER=v3.5.2
# choose either URL
GOOGLE_URL=https://storage.googleapis.com/etcd
GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
DOWNLOAD_URL=${GOOGLE_URL}
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test
curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
/tmp/etcd-download-test/etcd --version
/tmp/etcd-download-test/etcdctl version
/tmp/etcd-download-test/etcdutl version
cp /tmp/etcd-download-test/{etcd,etcdctl,etcdutl} /usr/local/bin/
Get the list of Members
etcdctl member list
The sample output will look like
318950680619d022, started, devops232.ef.com-5fc7a280, https://192.168.2.232:2380, https://192.168.2.232:2379, false
69619e9cd4388b05, started, devops230.ef.com-0638e58e, https://192.168.2.230:2380, https://192.168.2.230:2379, false
9ed2c4ae5a856ded, started, devops231.ef.com-cdfe7233, https://192.168.2.231:2380, https://192.168.2.231:2379, false
As we are going to remove the faulted node "devops230.ef.com" from the cluster, note the ide returned from above command in first column and remove it from the ETCD cluster
etcdctl member remove 69619e9cd4388b05
once the node is removed from ETCD, drain the workload on the node from the k3s cluster using
kubectl drain <node_name> --delete-emptydir-data --ignore-daemonsets
and then delete the node by running
kubectl delete node <node_name>
Get the nodes list in the cluster
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
devops231.ef.com Ready control-plane,etcd,master 76m v1.24.7+k3s1
devops232.ef.com Ready control-plane,etcd,master 78m v1.24.7+k3s1
devops233.ef.com Ready <none> 76m v1.24.7+k3s1
devops234.ef.com Ready <none> 78m v1.24.7+k3s1
Add the new Control-Plane Node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.24.7+k3s1 K3S_TOKEN=<TOKEN> sh -s server --server https://<SAN-IP>:6443 --disable traefik,local-storage --tls-san <SAN-IP>
Get the nodes in the cluster
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
devops231.ef.com Ready control-plane,etcd,master 76m v1.24.7+k3s1
devops232.ef.com Ready control-plane,etcd,master 78m v1.24.7+k3s1
devops233.ef.com Ready <none> 76m v1.24.7+k3s1
devops234.ef.com Ready <none> 78m v1.24.7+k3s1
devops243.ef.com Ready control-plane,etcd,master 91s v1.24.7+k3s1
Wait for some time to let the ETCD server sync the data
# kubectl get cs
NAME STATUS MESSAGE ERROR
etcd-0 Healthy {"health":"true","reason":""}
scheduler Healthy ok
controller-manager Healthy ok