Replace a failed control plane node RKE2
Replace a failed control plane node
Confirm the failed control plane node is not communicating with other nodes.
kubectl get nodes --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"
NAME READY cluster-control-plane-0 True cluster-control-plane-1 Unknown cluster-control-plane-2 True
If the node’s
READY
column does not sayTrue
, then the node is not ready. In the above example, thecluster-control-plane-1
node is not ready.Permanently remove the failed node.
For example, if the node is an AWS EC2 instance, use the AWS CLI or Console to terminate the instance.
Identify an etcd member ready to accept etcd API requests.
kubectl -n kube-system get pod --selector=tier=control-plane,component=etcd --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"
NAME READY etcd-cluster-control-plane-0 True etcd-cluster-control-plane-1 False etcd-cluster-control-plane-2 True
In the above example, both
etcd-cluster-control-plane-0
andetcd-cluster-control-plane-2
are ready. Choose one and note (or copy) its name.Identify the failed etcd member.
Find the ID of the etcd member for the failed control plane node.
READY_ETCD_MEMBER="<name of etcd member from previous step>" ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --endpoints=https://127.0.0.1:2379" kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member list"
1d021ffdd096a804, started, cluster-control-plane-1, https://172.17.0.6:2380, https://172.17.0.6:2379, false 40fd14fa28910cab, started, cluster-control-plane-0, https://172.17.0.4:2380, https://172.17.0.4:2379, false 87651970646a8073, started, cluster-control-plane-2, https://172.17.0.5:2380, https://172.17.0.5:2379, false
In the above example, the failed control node is
cluster-control-plane-1
, so the etcd ID is1d021ffdd096a804
. Note, or copy, this ID.Remove the failed etcd member.
READY_ETCD_MEMBER="<name of etcd member from previous steps>" ETCD_ID_TO_REMOVE="<etcd member ID from previous step>" ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --endpoints=https://127.0.0.1:2379" kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member remove $ETCD_ID_TO_REMOVE"
Member 1d021ffdd096a804 removed from cluster a6ea9ad1b116d02f
Delete the failed Node from the Kubernetes API.
kubectl delete node cluster-control-plane-1
Create a new control plane node to replace the failed node.