CX Worker HA Deployment

This Kubernetes deployment requires a single Control Plane node while workload is distributed across multiple Worker nodes across sites. The solution continues to work even if the Control Plane is down. However, Control Plane functions such as rescheduling of the workload cannot be performed during the outage of the Control Plane.

The CX solution will go on a partial or full failure when the Control Plane and any Worker node(s) are down simultaneously.

Suitable when

Only two physical sites are available in the infrastructure for the Kubernetes deployment.
MySQL cluster is not available in the customer infrastructure that this Kubernetes deployment may use for its state management.
High availability of workload is needed
A temporary outage High availability of Control Plane is not needed.

Recommendations

Provide a cloud-native storage such as NFS, VSAN, or any CNCF certified cloud-native storage accessible at both sites.
Do a snapshot backup of the Control Plane to a neutral S3 compatible storage. In case of the failure of the site hosting Control Plane node, a restore of Control Plane on any available node may be performed.

Limitations

In the case of the Control Plane failure to auto-restart manual intervention to fix/restart/restore the Control Plane may be needed.
If a site hosting Control Plane and one or more Worker nodes is down then the solution may be partially or completely down.
rescheduling of the worker node cannot be done until the Control Plane node is up again.

What-if Scenarios

See the What if scenarios below to learn more about the behaviour of this cluster.

What if	Then
A POD is down	Some features of the application may fail to work.
A component/service running inside the POD is down	This will affect the interworking of the application, it might not be able to query or save data.
A component/service running inside the POD is up	Application will start working normally
A worker-node is down	This will cause a 5 minute downtime for some features within the application, as some pods will move over to another working node.
A worker-node is restored	No affect.
The control-plane node is down	No configurational changes will happen, you will need to wait for the control-plane node to be up to make any system changes.
The control-plane node is restored	Configurations can now be made
A worker-node is down while the control-plane is down	Application may fail to work.

Installation Steps

Step-1 Install RKE2 Control-plane

Install RKE2 control-plane RKE2 Control-plane Deployment

Step-2 Install Kube-VIP

A multi-worker cluster requires a floating IP shared among all workers. Kube-vip provides a floating IP among all workers. This ensures availability of the workload in the temporary absence of the control plane.

KubeVIP is not needed if you are creating a High Availability deployment using DNS and just adding worker nodes.

Kube-VIP Requirements

ARP – When using ARP or Layer 2 it will use leader election. Other modes that can also be used such as BGP, Routing Table and Wireguard. To understand how Kube-vip works do refer to this Kube-vip architecture.
A‍‍‍ll worker nodes have same interface names. The interface names can be views by typing in 'ip a s' in terminal.
All worker nodes including VIP should be on the same subnet for VIP configuration.
ARP is allowed on this worker nodes subnet
VIP has an FQDN assigned (used for CX)

Decide the IP and the interface on all nodes for Kube-VIP and setup these as environment variables. This step must be completed before deploying any other additional nodes in the cluster (both CP and Workers).
CODE
```
export VIP=<Virtual-IP>
export INTERFACE=<Interface>
```

Import the RBAC manifest for Kube-VIP

CODE

curl https://kube-vip.io/manifests/rbac.yaml > /var/lib/rancher/rke2/server/manifests/kube-vip-rbac.yaml

Fetch the kube-vip image

CODE

/var/lib/rancher/rke2/bin/crictl -r "unix:///run/k3s/containerd/containerd.sock"  pull ghcr.io/kube-vip/kube-vip:latest

Deploy the Kube-VIP

CODE

CONTAINERD_ADDRESS=/run/k3s/containerd/containerd.sock  ctr -n k8s.io run \
--rm \
--net-host \
ghcr.io/kube-vip/kube-vip:latest vip /kube-vip manifest daemonset --arp --interface $INTERFACE --address $VIP --controlplane  --leaderElection --services --inCluster | tee /var/lib/rancher/rke2/server/manifests/kube-vip.yaml

Wait for the kube-vip to complete bootstrapping

CODE

kubectl rollout status daemonset   kube-vip-ds    -n kube-system   --timeout=650s

Once the condition is met, you can check the daemonset by kube-vip is running 1 pod
CODE
```
kubectl  get ds -n kube-system  kube-vip-ds
```
Once the cluster has more control-plane nodes added, the count will be equal to the total number of CP nodes.

Step-3 Get Control-plane token

On the control-plane node, run the following command to get the control-plane token to join worker(s) with this control-plane.

BASH

cat /var/lib/rancher/rke2/server/node-token
# It will display the node-token as something like the following 
K10e2bfc647bbf0839a7997cdcbee8754b3cd841e85e4250686161893f2b139c7d8::server:a342ef5189711287fb48f05c05346b89

Step-4 Add Worker(s)

On each worker node,

Run the following command to install RKE2 agent on the worker.
BASH
```
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -
```
Enable the rke2-agent service by using the following command.
BASH
```
systemctl enable rke2-agent.service
```
Create a directory by running the following commands.
BASH
```
mkdir -p /etc/rancher/rke2/
```
Add/edit /etc/rancher/rke2/config.yaml and update the following fields.
1. <Control-Plane-IP> This is the IP for the control-plane node.
2. <Control-Plane-TOKEN> This is the token which can be extracted from first control-plane by running cat /var/lib/rancher/rke2/server/node-token
  BASH
```
server: https://<Control-Plane-IP>:9345
token: <Control-Plane-TOKEN>
tls-san:
  - <FQDN>
write-kubeconfig-mode: \"0644\"
etcd-expose-metrics: true
```
Start the service by using follow command.
BASH
```
systemctl start rke2-agent.service
```

Step 5: Verify

On the control-plane node run the following command to verify that the worker(s) have been added.

CODE

kubectl get nodes -o wide

Next Steps

Choose Storage

Use a cloud native storage for a Worker HA setup. For available storage options, see Storage Solution - Getting Started

For multi-node (Worker HA) you can use local storage with node affinity. But, this will impose a restriction on worker nodes that a workload will have to be provisioned from the same node it was setup initially.

Setup CX on Kubernetes

To deploy Expertflow CX on this node, see CX Deployment on Kubernetes