Replace a failed control plane node RKE2

Replace a failed control plane node

NOTE: If a majority of control plane nodes failed permanently, for example the instance has been terminated, then a new etcd cluster must be created.

Confirm the failed control plane node is not communicating with other nodes.

kubectl get nodes --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"

NAME                                     READY
cluster-control-plane-0   True
cluster-control-plane-1   Unknown
cluster-control-plane-2   True

If the node’s READY column does not say True, then the node is not ready. In the above example, the cluster-control-plane-1 node is not ready.

Permanently remove the failed node.
For example, if the node is an AWS EC2 instance, use the AWS CLI or Console to terminate the instance.

Identify an etcd member ready to accept etcd API requests.

kubectl -n kube-system get pod --selector=tier=control-plane,component=etcd --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"

NAME                                          READY
etcd-cluster-control-plane-0   True
etcd-cluster-control-plane-1   False
etcd-cluster-control-plane-2   True

In the above example, both etcd-cluster-control-plane-0 and etcd-cluster-control-plane-2 are ready. Choose one and note (or copy) its name.

Identify the failed etcd member.

Find the ID of the etcd member for the failed control plane node.

READY_ETCD_MEMBER="<name of etcd member from previous step>"
ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --endpoints=https://127.0.0.1:2379"
kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member list"

1d021ffdd096a804, started, cluster-control-plane-1, https://172.17.0.6:2380, https://172.17.0.6:2379, false
40fd14fa28910cab, started, cluster-control-plane-0, https://172.17.0.4:2380, https://172.17.0.4:2379, false
87651970646a8073, started, cluster-control-plane-2, https://172.17.0.5:2380, https://172.17.0.5:2379, false

In the above example, the failed control node is cluster-control-plane-1, so the etcd ID is 1d021ffdd096a804. Note, or copy, this ID.

Remove the failed etcd member.

READY_ETCD_MEMBER="<name of etcd member from previous steps>"
ETCD_ID_TO_REMOVE="<etcd member ID from previous step>"
ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --endpoints=https://127.0.0.1:2379"
kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member remove $ETCD_ID_TO_REMOVE"

Member 1d021ffdd096a804 removed from cluster a6ea9ad1b116d02f

Delete the failed Node from the Kubernetes API.
```
kubectl delete node cluster-control-plane-1
```
Create a new control plane node to replace the failed node.