Skip to main content
Skip table of contents

Replace a failed control plane node RKE2


Replace a failed control plane node

NOTE: If a majority of control plane nodes failed permanently, for example the instance has been terminated, then a new etcd cluster must be created.

  1. Confirm the failed control plane node is not communicating with other nodes.

    kubectl get nodes --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"
    


    NAME                                     READY
    cluster-control-plane-0   True
    cluster-control-plane-1   Unknown
    cluster-control-plane-2   True
    


    If the node’s READY column does not say True, then the node is not ready. In the above example, the cluster-control-plane-1 node is not ready.

  2. Permanently remove the failed node.

    For example, if the node is an AWS EC2 instance, use the AWS CLI or Console to terminate the instance.

  3. Identify an etcd member ready to accept etcd API requests.

    kubectl -n kube-system get pod --selector=tier=control-plane,component=etcd --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"
    


    NAME                                          READY
    etcd-cluster-control-plane-0   True
    etcd-cluster-control-plane-1   False
    etcd-cluster-control-plane-2   True
    


    In the above example, both etcd-cluster-control-plane-0 and etcd-cluster-control-plane-2 are ready. Choose one and note (or copy) its name.

  4. Identify the failed etcd member.

    Find the ID of the etcd member for the failed control plane node.

    READY_ETCD_MEMBER="<name of etcd member from previous step>"
    ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --endpoints=https://127.0.0.1:2379"
    kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member list"
    


    1d021ffdd096a804, started, cluster-control-plane-1, https://172.17.0.6:2380, https://172.17.0.6:2379, false
    40fd14fa28910cab, started, cluster-control-plane-0, https://172.17.0.4:2380, https://172.17.0.4:2379, false
    87651970646a8073, started, cluster-control-plane-2, https://172.17.0.5:2380, https://172.17.0.5:2379, false
    


    In the above example, the failed control node is  cluster-control-plane-1, so the etcd ID is 1d021ffdd096a804. Note, or copy, this ID.

  5. Remove the failed etcd member.

    READY_ETCD_MEMBER="<name of etcd member from previous steps>"
    ETCD_ID_TO_REMOVE="<etcd member ID from previous step>"
    ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --endpoints=https://127.0.0.1:2379"
    kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member remove $ETCD_ID_TO_REMOVE"
    


    Member 1d021ffdd096a804 removed from cluster a6ea9ad1b116d02f
    


  6. Delete the failed Node from the Kubernetes API.

    kubectl delete node cluster-control-plane-1
    


  7. Create a new control plane node to replace the failed node.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.