Geographical Deployment of Expertflow CX with Redundancy
The purpose of this document is to provide steps for the Geo-cluster (multi-region) deployment of Expertflow CX solution. The main purpose is to provide a redundant site for disaster recovery (DR) with the primary site.
Current Deployment Scenario
We have tested the Geo-cluster solution deployment on a collection of 3 master nodes with 3 worker nodes.
3 Control Plane nodes
3 Worker nodes
Cstor for replicated storage on worker nodes based on block storage
System Requirements
The system requirements for Geo-cluster solution using Kubernetes RKE2 distribution are:
RAM (GB) | CPU | DISK | Minimum Nodes |
16 | 8 | 150 GB per node 100 additional unformatted block storage for each worker node. |
|
Storage Setup - cStor
cStor is the recommended resilient storage for Geo-cluster solution.
cStor uses the raw block devices attached to the Kubernetes worker nodes to create cStor Pools. There are raw (unformatted) block devices attached to the Kubernetes worker nodes. The devices can be either direct attached devices (SSD/HDD) or cloud volumes (GPD, EBS).
Deployment Steps for cStor
Deploy cStor on the first control plane node.
The deployment of cStor in our scenario is done using Helm. Helm helps us manage the Kubernetes application. Helm documentation can be accessed here. Helm deploys the components on all the added nodes automatically.
helm repo add openebs https://openebs.github.io/charts
helm repo update
helm uninstall rke2-snapshot-controller rke2-snapshot-controller-crd -n kube-system
helm install openebs --namespace openebs openebs/openebs --set cstor.enabled=true --create-namespace
To verify that pods are up and running, use the following command:
kubectl get pod -n openebs
Now, we need to block devices on each of the nodes (no file system must be present when the drive is mounted). To verify the presence of available block storage, use the following command:
kubectl get bd -n openebs
Sample Output:-
NAME NODENAME SIZE CLAIMSTATE STATUS AGE
blockdevice-01afcdbe3a9c9e3b281c7133b2af1b68 worker-node-3 21474836480 Unclaimed Active 2m10s
blockdevice-10ad9f484c299597ed1e126d7b857967 worker-node-1 21474836480 Unclaimed Active 2m17s
blockdevice-3ec130dc1aa932eb4c5af1db4d73ea1b worker-node-2 21474836480 Unclaimed Active 2m12s
The above command shows the node name and its block device. Make sure that all the worker nodes have block devices present as this is what will be used when deploying a replicated storage pool.
Creation of Storage Pool
Use the above block devices to create a storage pool. Create a new file called
cspc.yaml and modify it's content as below:
apiVersion: cstor.openebs.io/v1
kind: CStorPoolCluster
metadata:
name: cstor-disk-pool
namespace: openebs
spec:
pools:
- nodeSelector:
kubernetes.io/hostname: "worker-node-1"
dataRaidGroups:
- blockDevices:
- blockDeviceName: "blockdevice-10ad9f484c299597ed1e126d7b857967"
poolConfig:
dataRaidGroupType: "stripe"
- nodeSelector:
kubernetes.io/hostname: "worker-node-2"
dataRaidGroups:
- blockDevices:
- blockDeviceName: "blockdevice-3ec130dc1aa932eb4c5af1db4d73ea1b"
poolConfig:
dataRaidGroupType: "stripe"
- nodeSelector:
kubernetes.io/hostname: "worker-node-3"
dataRaidGroups:
- blockDevices:
- blockDeviceName: "blockdevice-01afcdbe3a9c9e3b281c7133b2af1b68"
poolConfig:
dataRaidGroupType: "stripe"
In the above command, the block device ID reflects the node to which it is attached. In NodeSelector, add the hostname of the node.
To get the nodeSelector value for each host, run the following command:
kubectl get node --show-labels
Edit the hostname and the block id relevant to each node on the above cspc.yaml file. Once done run the following command:
kubectl apply -f cspc.yaml
To verify that all the block devices are part of the storage pool, run the following command. This usually takes around 3-5 minutes.
kubectl get cspc -n openebs
Sample output:-
NAME HEALTHYINSTANCES PROVISIONEDINSTANCES DESIREDINSTANCES AGE
cstor-disk-pool 3 3 3 2m2s
Now verify each block device has its pool online with the following command:
kubectl get cspi -n openebs
Sample output:-
NAME HOSTNAME ALLOCATED FREE CAPACITY STATUS AGE
cstor-disk-pool-vn92 worker-node-1 60k 9900M 9900M ONLINE 2m17s
cstor-disk-pool-al65 worker-node-2 60k 9900M 9900M ONLINE 2m17s
cstor-disk-pool-y7pn worker-node-3 60k 9900M 9900M ONLINE 2m17s
Once the above is verified we now need to create a storage class.
Create a file name cstor-csi-disk.yaml and paste the below contents into it.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: cstor-csi-disk
provisioner: cstor.csi.openebs.io
allowVolumeExpansion: true
parameters:
cas-type: cstor
# cstorPoolCluster should have the name of the CSPC
cstorPoolCluster: cstor-disk-pool
# replicaCount should be <= no. of CSPI created in the selected CSPC
replicaCount: "3"
After copying the above contents in the
cstor-csi-disk.yaml, apply it using the following command:
kubectl apply -f cstor-csi-disk.yaml
Following the above command, you will have a cStor storage class. You can verify it by using the following command:
kubectl get sc
The output of the command shows the three possible storage classes - one for cStor and the other two for the local provisioner:
cstor-csi-disk cstor.csi.openebs.io Delete Immediate true 34h
openebs-device openebs.io/local Delete WaitForFirstConsumer false 34h
openebs-hostpath (default) openebs.io/local Delete WaitForFirstConsumer false 34h
OpenEBS for Local Storage
Deploying OpenEBS enables localhost storage as target devices. We have deployed all the components using OpenEBS local path storage.
Make the storage class default, and replace <name> with your storage class name:
kubectl patch storageclass openebs-hostpath -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Clone the Experflow CX Repository
Start with cloning the repository from GitLab.
git clone -b <branch-name> https://efcx:RecRpsuH34yqp56YRFUb@gitlab.expertflow.com/cim/cim-solution.git
Modify the <release-branch> with your desired release branch name.
Node Affinity-Based Deployment
To ensure the pods of the CX solution components are bound to site A as long as it is available, we have applied Node Affinity to CX components with assigned weightage which ensure pods are spined up on Site A when they are first deployed. In any case Node A becomes unavailable, and pods will be shifted from site A to site B.
For the replicas of the stateful set components, we have applied Pod anti-affinity which ensures the replicas are spined up on site B and no two Primary and Replica Pods are running on the same node if both sites are available. (i.e Mongodb's replica pod will not be in site A if site A is also hosting Mongodb's primary pod.)
Tainting Control Plane Nodes
By default, a control plane node can manage application workloads as well. This is okay for a lighter workload (~50 concurrent conversations) and CX Single Node Deployment. But, for a higher workload or a multi-cluster setup, all control plane nodes should be tainted to schedule control-plane pods only.
First, get the nodes to identify which are control-plane/master nodes.
kubectl get nodes
Then to taint the master nodes, use the following command for each master node.
kubectl taint nodes (nodename) node-role.kubernetes.io/master:NoSchedule
Once done allow the RKE Ingress to spin up on the control plane as well.
kubectl patch ds rke2-ingress-nginx-controller -n kube-system --type json -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "node-role.kubernetes.io/master", "operator": "Exists", "effect": "NoSchedule"}]}]'
Expertflow CX Internal Components
Step 1: Change the directory
Change to the directory to locate all the deployment yaml files.
cd ../..
Step 2: Blueprint for Node Affinity on CX components
We will be using Node Affinity to keep the workload on the primary site at the start, only in case of downtime will the node shift towards any other site.
To apply Node Affinity, we first need to label our worker nodes.
First, get the workers nodes by using the following command.
kubectl get nodes --show-labels
Output will be similar to this:-
NAME STATUS ROLES AGE VERSION LABELS
vm3 Ready control-plane,etcd,master 37d v1.24.7+k3s1 ..node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io...
vm05 Ready <none> 37d v1.24.7+k3s1 ..egress.k3s.io/cluster=true,env=cti,kubernetes.io/arch=amd64..
vm1 Ready control-plane,etcd,master 37d v1.24.7+k3s1 ..egress.k3s.io/cluster=true,env=cim,kubernetes.io/arch=amd64..
Now we will need to label these worker nodes and label the primary site worker nodes with the "primary" label.
kubectl label nodes <node name> site=primary
Example:-
kubectl label nodes vm05 site=primary
Use the above command to label all the primary site worker nodes appropriately as above. You can also label your secondary site worker nodes similarly if needed according to your needs.
Step 3: Blueprint for applying Affinity on the Deployments.
Once labeling is completed for the primary site worker nodes, verify if the affinity block has been added in the pod deployment yaml files to allow the pods to always start spinning up in the primary site.
Yaml files can be found in the following directory.
cd cim/Deployments
Open up any of the yaml files and it should have the following node affinity values as uncommented down below.
It needs to be like the following example below for all yaml files.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
ef.service: ef-agent-manager
ef: expertflow
name: ef-agent-manager
namespace: expertflow
spec:
replicas: 1
selector:
matchLabels:
ef.service: ef-agent-manager
strategy: {}
template:
metadata:
labels:
ef.service: ef-agent-manager
ef: expertflow
spec:
imagePullSecrets:
- name: expertflow-reg-cred
# affinity:
# nodeAffinity:
# preferredDuringSchedulingIgnoredDuringExecution:
# - weight: 50
# preference:
# matchExpressions:
# - key: site
# operator: In
# values:
# - worker-a
These are the exact value files that need to be commented out for node affinity to work in all the deployment yaml file. Change the "Value" annotation to the label you assigned to the worker node in Step 2. In the above example, we will change the "value" from "worker-a" to "primary" or to whatever label you have assigned to your worker nodes.
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: site
operator: In
values:
- primary
Step 4: Disabling Init Containers in Deployment files.
We will also need to disable init containers in the same path as
cd cim/Deployments
First, we will disable init containers for ef-conversation controller yaml.
vi ef-conversation-controller-deployment.yaml
And make sure the below lines are commented out containing the init container, all of these lines need to be commented out.
## initContainers:
## - name: wait-for
## image: ghcr.io/patrickdappollonio/wait-for:latest
## imagePullPolicy: IfNotPresent
## env:
## - name: MONGO_HOST
## value: "mongo-mongodb.ef-external.svc.cluster.local:27017"
## - name: REDIS_HOST
## value: "redis-master.ef-external.svc.cluster.local:6379"
## command:
## - /wait-for
## args:
## - --host="$(MONGO_HOST)"
## - --host="$(REDIS_HOST)"
## - --verbose
Now once these are done we will also comment out for routing engine deployment yaml
vi ef-routing-engine-deployment.yaml
Again comment out the following lines.
## initContainers:
## - name: wait-for
## image: ghcr.io/patrickdappollonio/wait-for:latest
## imagePullPolicy: IfNotPresent
## env:
## - name: MONGO_HOST
## value: "mongo-mongodb.ef-external.svc.cluster.local:27017"
## - name: REDIS_HOST
## value: "redis-master.ef-external.svc.cluster.local:6379"
## command:
## - /wait-for
## args:
## - --host="$(MONGO_HOST)"
## - --host="$(REDIS_HOST)"
## - --verbose
The below architecture is used while using pod anti-affinity, pod affinity, node affinity and anti-affinity.
Components | Node Affinity | Pod anti affinity | NodeSelector |
---|---|---|---|
CIM Solution Components (CX Components) | ✓ | ||
Postgres (master) | ✓ | ||
Postgres (replica) | |||
Mongodb (replica and master) | ✓ | ||
Redis (master) | ✓ | ||
Redis (replica) | ✓ | ||
Keycloak | ✓ |
Step 5: Install Rancher OPTIONAL STEP
Rancher is web-UI for managing Kubernetes clusters.
To deploy the Rancher Web-UI, add the Helm repository.
Install the cert-manager required for the Rancher.
After installation, wait for at least 30 seconds for cert-manager to start
helm upgrade --install=true cert-manager \
--wait=true \
--timeout=10m0s \
--debug \
--namespace cert-manager \
--create-namespace \
--version v1.10.0 \
--values=external/cert-manager/values.yaml \
external/cert-manager
Use the following command to see if all cert-manager pods are up and running.
kubectl get pods -n cert-manager
Deploy the rancher using Helm Chart.
helm upgrade --install=true --wait=true --timeout=10m0s --debug rancher --namespace cattle-system --create-namespace --values=external/rancher/values.yaml external/rancher
Rancher is by default not accessible outside the cluster. To make it accessible, change the service type from Cluster-IP to NodePort:
kubectl -n cattle-system patch svc rancher -p '{"spec": {"type": "NodePort"}}'
Get the Rancher Service port by using the following command:
kubectl -n cattle-system get svc rancher -o go-template='{{(index .spec.ports 1).nodePort}}';echo;
Now you can access the Rancher Web UI. It will be accessible at any-node-ip-of-cluster:PORT-from-above-command.
default username/password is admin/ExpertflowRNCR
Step 6: Create Namespace
All Expertflow components are deployed in a separate namespace inside the Kubernetes called 'expertflow
'.
Run the following command on the master node. Create the namespace using the command.
kubectl create namespace expertflow
All external components will be deployed in
ef-external
namespace. Run the following command on the master node.
kubectl create namespace ef-external
Step 7: Image Pull Secret
For expertflow namespace, use the following command:
kubectl apply -f pre-deployment/registryCredits/ef-imagePullSecret-expertflow.yaml
Run the following command for ef-external namespace:
kubectl apply -f pre-deployment/registryCredits/ef-imagePullSecret-ef-external.yaml
Step 8: Update the FQDN
Decide the FQDN to be used in your solution and change the <FQDN> to your actual FQDN as given in the following command:
sed -i 's/devops[0-9]*.ef.com/<FQDN>/g' cim/ConfigMaps/* pre-deployment/grafana/* pre-deployment/keycloak/* cim/Ingresses/traefik/*
Replace FQDN with the name of your Master Node FQDN when deploying the solution on Single Control Plane node. For Multi-Control-plane setup, use VIP or FQDN associated with VIP
Expertflow CX External Components
Following are the required external components that need to be deployed with Expertflow CX:
Blueprint for Pod anti-affinity on Replica Pods For Helm Based Deployments (Optional)
These methods are to be used for helm-based deployments i.e. all the external components so that their replica pods don't spin up on the same node if it is not done on its own. These changes can be made on any helm file with the below pre-existing values in it’s .yaml file.
1. To apply pod anti-affinity, we first need to label pods so they can be segregated based on their labeling. To label pods, the following should be added to the value file:
labels:
app: store
Next, once the pod has been assigned a label,
To set pod anti-affinity change the following values in the helm file
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
In the above code, the topology key refers to the node label on which you are applying pod anti-affinity to ensure no two pods whose label has been assigned as app with value store be spinning up together on this node.
Deploying External Components:
1. PostgreSQL
To deploy Postgres in high availability we will use Postgre's pgpool which provides automated failover while also ensuring high availability in case any of the master pods is affected within a node. This deployment can be done directly from the helm file only changes needed are amount of replica pods to be deployed which can be adjusted. This value is changed on both under postgresql and under pgpool as well.
To change the amount of replicas edit the following value in values.yaml file which can be found using the command below.
cd external/bitnami/postgresql-ha/values.yaml
postgresql:
image:
registry: docker.io
repository: bitnami/postgresql-repmgr
tag: 15.3.0-debian-11-r2
digest: ""
## Specify a imagePullPolicy. Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
## ref: https://kubernetes.io/docs/user-guide/images/#pre-pulling-images
##
pullPolicy: IfNotPresent
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
## Example:
## pullSecrets:
## - myRegistryKeySecretName
##
pullSecrets: []
## Set to true if you would like to see extra information on logs
##
debug: false
## @param postgresql.labels Labels to add to the StatefulSet. Evaluated as template
##
labels: {}
replicaCount: 3
--------------------------------------------------------------------------------------------------------------------------------------------------------
pgpool:
## Bitnami Pgpool image
## ref: https://hub.docker.com/r/bitnami/pgpool/tags/
## @param pgpool.image.registry Pgpool image registry
## @param pgpool.image.repository Pgpool image repository
## @param pgpool.image.tag Pgpool image tag
## @param pgpool.image.digest Pgpool image digest in the way sha256:aa.... Please note this parameter, if set, will override the tag
## @param pgpool.image.pullPolicy Pgpool image pull policy
## @param pgpool.image.pullSecrets Specify docker-registry secret names as an array
## @param pgpool.image.debug Specify if debug logs should be enabled
##
image:
registry: docker.io
repository: bitnami/pgpool
tag: 4.4.2-debian-11-r33
digest: ""
## Specify a imagePullPolicy. Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
## ref: https://kubernetes.io/docs/user-guide/images/#pre-pulling-images
replicaCount: 3
By default the value is set to 3 based on the number of nodes this can be changed as per needed.
PostgreSQL is deployed as a central datastore for both LicenseManager and Keycloak.
Create configmap for PostgreSQL to load the LicenseManager database and create keycloak_db:
kubectl -n ef-external create configmap ef-postgresql-license-manager-cm --from-file=./pre-deployment/licensemanager/licensemanager.sql
Helm command for postgreSQL for clusters as given below:
helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/postgresql-ha/values.yaml ef-postgresql external/bitnami/postgresql-ha
2. Keycloak
Since keycloak doesn't offer high availability within itself we manage it by providing it the external postgres that we deployed above and change it's internal database to the postgres one deployed externally.
The following changes need to be made in the keycloak’s helm values-ha.yaml file, which can be found in the following location
cd external/bitnami/keycloak/
postgresql:
enabled: false
auth:
username: bn_keycloak
password: ""
database: bitnami_keycloak
existingSecret: ""
architecture: standalone
## External PostgreSQL configuration
## All of these values are only used when postgresql.enabled is set to false
## @param externalDatabase.host Database host
## @param externalDatabase.port Database port number
## @param externalDatabase.user Non-root username for Keycloak
## @param externalDatabase.password Password for the non-root username for Keycloak
## @param externalDatabase.database Keycloak database name
## @param externalDatabase.existingSecret Name of an existing secret resource containing the database credentials
## @param externalDatabase.existingSecretPasswordKey Name of an existing secret key containing the database credentials
## EXPERTFLOW
externalDatabase:
host: "ef-postgresql-postgresql-ha-pgpool.ef-external.svc.cluster.local"
port: 5432
user: sa
database: keycloak_db
password: "Expertflow123"
existingSecret: ""
existingSecretPasswordKey: ""
On the master node, create a global configmap for keycloak. Change the hostname and other parameters before applying this command in
ef-keycloak-configmap.yaml
file:
kubectl apply -f pre-deployment/keycloak/ef-keycloak-configmap.yaml
The Helm command for Keycloak is given below:
helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/keycloak/values-ha.yaml keycloak external/bitnami/keycloak/
3. Mongo DB
To enable high availability for MongoDB the following changes need to be made in the mongodb's helm value file. The Arbiter needs to be set as true. Affinity needs to be applied as below.
ReplicaCount needs to be set as per available worker nodes and hostname needs to be enabled while setting the appropriate replicasetname as per below.
Helm file values-ha.yaml can be located at
cd external/bitnami/mongodb/
arbiter:
affinity: {}
annotations: {}
args: []
command: []
configuration: ""
containerPorts:
mongodb: 27017
containerSecurityContext:
enabled: true
runAsNonRoot: true
runAsUser: 1001
customLivenessProbe: {}
customReadinessProbe: {}
customStartupProbe: {}
enabled: true
-------------------------------------------------------------------------------------
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: kubernetes.io/hostname
annotations: {}
--------------------------------------------------------------------------------------
replicaCount: 3
replicaSetConfigurationSettings:
configuration:
catchUpTimeoutMillis: 30000
chainingAllowed: false
electionTimeoutMillis: 10000
heartbeatIntervalMillis: 2000
heartbeatTimeoutSecs: 20
enabled: true
replicaSetHostnames: true
replicaSetName: expertflow
1.Helm deployment for Mongo command is given below
helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/mongodb/values-ha.yaml mongo external/bitnami/mongodb/
4. MinIO
To deploy minio in high availability, the following changes can be made to the helm value file for minio. The mode needs to be selected as distributed. ReplicaCount needs to be set as per need but it should be in even numbers and greater than or equal to 4, Zone should be 1, and drives per node should be 1 as well. Affinity needs to be applied as per below based on the first set pod label, then set value in affinity block.
Helm file values-ha.yaml can be located at
cd external/bitnami/minio/
clientImage:
registry: docker.io
repository: bitnami/minio-client
tag: 2022.12.13-debian-11-r0
digest: ""
## @param mode MinIO® server mode (`standalone` or `distributed`)
## ref: https://docs.minio.io/docs/distributed-minio-quickstart-guide
##
mode: distributed
--------------------------------------------------------------------------------------
statefulset:
## @param statefulset.updateStrategy.type StatefulSet strategy type
## ref: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies
## e.g:
## updateStrategy:
## type: RollingUpdate
## rollingUpdate:
## maxSurge: 25%
## maxUnavailable: 25%
##
updateStrategy:
type: RollingUpdate
## @param statefulset.podManagementPolicy StatefulSet controller supports relax its ordering guarantees while preserving its uniqueness and identity guarantees. There are two valid pod management policies: OrderedReady and Parallel
## ref: https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#pod-management-policy
##
podManagementPolicy: Parallel
## @param statefulset.replicaCount Number of pods per zone (only for MinIO® distributed mode). Should be even and `>= 4`
##
replicaCount: 4
zones: 1
## @param statefulset.drivesPerNode Number of drives attached to every node (only for MinIO® distributed mode)
##
drivesPerNode: 1
-------------------------------------------------------------------------------------------------------------------------
podLabels:
app: minio
nodeAffinityPreset:
## @param nodeAffinityPreset.type Node affinity preset type. Ignored if `affinity` is set. Allowed values: `soft` or `hard`
##
type: ""
## @param nodeAffinityPreset.key Node label key to match. Ignored if `affinity` is set.
## E.g.
## key: "kubernetes.io/e2e-az-name"
##
key: ""
## @param nodeAffinityPreset.values Node label values to match. Ignored if `affinity` is set.
## E.g.
## values:
## - e2e-az1
## - e2e-az2
##
values: []
## @param affinity Affinity for pod assignment. Evaluated as a template.
## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
## Note: podAffinityPreset, podAntiAffinityPreset, and nodeAffinityPreset will be ignored when it's set
##
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- minio
topologyKey: "kubernetes.io/hostname"
helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/minio/values-ha.yaml minio external/bitnami/minio/
5. Redis Sentinel
To provide high availability to Redis we have opted to deploy Redis sentinel, which is a high-availability solution designed to enhance the reliability and fault-tolerance version of Redis. At its core, Redis Sentinel enables the creation of a robust Redis deployment consisting of multiple Redis instances and Sentinel nodes. These Sentinel nodes constantly monitor the health of the Redis instances and automatically detect any failures or performance degradation. Upon detecting an issue, Sentinel orchestrates the failover process, promoting a standby Redis instance to become the new master.
To Enable Sentinel and Set Amount of Replicas.
To enable sentinel edit the helm value file for Redis with the following changes. helm file can be found at
Helm file values-ha.yaml can be located at
cd external/bitnami/redis/
sentinel:
## @param sentinel.enabled Use Redis® Sentinel on Redis® pods.
## IMPORTANT: this will disable the master and replicas services and
## create a single Redis® service exposing both the Redis and Sentinel ports
##
enabled: true
## Bitnami Redis® Sentinel image version
Set the enabled flag as True. this will allow a replica to become a master in case one of the pods gets affected in a node.
To set the amount of replicas change the following value in the Redis helm value file.
replica:
## @param replica.replicaCount Number of Redis® replicas to deploy
##
replicaCount: 3
## @param replica.configuration Configuration for Redis® replicas nodes
## ref: https://redis.io/topics/config
##
Set the amount of replicas as needed based on the number of nodes.
helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/redis/values-ha.yaml redis-ha external/bitnami/redis/
Setup Realtime Reports
Expertflow CX uses Grafana for business and solution monitoring. Business monitoring dashboards are embedded inside AgentDesk that provide real-time statistics for both agents and supervisors.
See Setup Grafana for embedded dashboards for details.
Setup Historical Reports
Expertflow CX uses Apache Superset for historical reports.
Deploying Stateful Components
To circumvent ActiveMQ's availability we have provided it with cloud-replicated storage to keep its storage in, this solves the high availability challenge. We will be using cstor that we deployed above to provide cloud-replicated storage. And updating its connection strings if not already updated. Below is the list of changes made to its deployment yaml containing the changes.
Changes in Spec: Env
The file to be edited ef-amq-statefulset-ha.yaml is located at cim/StatefulSet/
vi cim/StatefulSet/ef-amq-statefulset-ha.yaml
Here we will be providing it with connection details for redis as well as postgres.
env:
- name: REDIS_HOST
value: redis-master.ef-external.svc.cluster.local
- name: REDIS_PORT
value: "6379"
- name: REDIS_PASSWORD
value: Expertflow123
- name: REDIS_SSL_ENABLED
value: "false"
- name: REDIS_MAX_ACTIVE
value: "100"
- name: REDIS_MAX_IDLE
value: "100"
- name: REDIS_MAX_WAIT
value: "-1"
- name: REDIS_MIN_IDLE
value: "50"
- name: REDIS_TIMEOUT
value: "2000"
- name: REDIS_SENTINEL_ENABLE
value: "true"
- name : REDIS_SENTINEL_MASTER
value: "expertflow"
- name : REDIS_SENTINEL_NODES
value: "redis-ha-node-0.redis-ha-headless.ef-external.svc.cluster.local:26379,redis-ha-node-1.redis-ha-headless.ef-external.svc.cluster.local:26379,redis-ha-node-2.redis-ha-headless.ef-external.svc.cluster.local:26379"
- name : REDIS_SENTINEL_PASSWORD
value: "Expertflow123"
- name: DB_URL
value: ef-postgresql-postgresql-ha-pgpool.ef-external.svc
- name: DB_USER
StatefulSet
ActiveMQ should be deployed before all other solution components. To deploy ActiveMQ as StatefulSet run.
kubectl apply -f cim/StatefulSet/ef-amq-statefulset-ha.yaml
Wait for the AMQ StatefulSet
kubectl wait pods ef-amq-0 -n ef-external --for condition=Ready --timeout=600s
Deploying CX Components
ConfigMaps
Conversation Manager ConfigMaps
If you need to change the default training, please update the corresponding files.
kubectl -n expertflow create configmap ef-conversation-controller-actions-cm --from-file=pre-deployment/conversation-Controller/actions
kubectl -n expertflow create configmap ef-conversation-controller-actions-pycache-cm --from-file=pre-deployment/conversation-Controller/__pycache__
kubectl -n expertflow create configmap ef-conversation-controller-actions-utils-cm --from-file=pre-deployment/conversation-Controller/utils
Unified Agent ConfigMaps
Translations for the unified agent are applicable in HC-4.1 and later releases.
kubectl -n expertflow create configmap ef-app-translations-cm --from-file=pre-deployment/app-translations/unified-agent/i18n
ConfigMaps have values that need to be uncommented for HA enablement. 1. Redis Sentinel 2. Mongodb
Edit the connection_env file in cim/ConfigMaps
vi cim/ConfigMaps/ef-connection-env-configmap.yaml
Enable the Redis Sentinel Flag in this and comment the single MongoDB host file and uncomment the multiple host file for MongoDB as show in below example
##MONGODB_HOST: mongodb://mongo-mongodb.ef-external.svc.cluster.local
MONGODB_HOST: mongodb://mongo-mongodb-0.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-1.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-2.mongo-mongodb-headless.ef-external.svc.cluster.local:27017/?replicaSet=expertflow&tls=false&ssl=false&retrywrites=true&w=majority
-------------------------------------------------------------------------------------------
REDIS_SENTINEL_ENABLE: "true"
Now make changes to License Manager ConfigMaps
vi cim/ConfigMaps/ef-license-manager-configmap.yaml
Uncomment the postgres-ha DB_URL as mentioned below and comment out the simple postgres DB URL.
#DB_URL: jdbc:postgresql://ef-postgresql.ef-external.svc.cluster.local:5432/licenseManager
DB_URL: jdbc:postgresql://ef-postgresql-postgresql-ha-pgpool.ef-external.svc.cluster.local:5432/licenseManager
Apply all the configmap in ConfigMaps folder using
kubectl apply -f cim/ConfigMaps/
Services
Create services for all deployment EF components
kubectl apply -f cim/Services/
Services must be created before Deployments
Deployments
apply all the Deployment manifests
kubectl apply -f cim/Deployments/
Team Announcement CronJob
Team announcement cron job is applicable in HC-4.2 and later releases.
kubectl apply -f pre-deployment/team-announcement/
Import your own certificates
Now generate a secret with the certificate files. You must have a private.key and server.crt files available on the machine and in the correct directory.
for expertflow namespace:
kubectl -n expertflow create secret tls ef-ingress-tls-secret \
--key pre-deployment/certificates/server.key \
--cert pre-deployment/certificates/server.crt
and for ef-external namespace
kubectl -n ef-external create secret tls ef-ingress-tls-secret \
--key pre-deployment/certificates/server.key \
--cert pre-deployment/certificates/server.crt
Import your own certificates for RKE
Now generate a secret with the following commands.
please modify the <FQDN> with your current fqdn before applying this command.
openssl req -x509 \
-newkey rsa:4096 \
-sha256 \
-days 3650 \
-nodes \
-keyout <fQDN>.key \
-out <FQDN>.crt \
-subj "/CN=<FQDN>" \
-addext "subjectAltName=DNS:www.<FQDN>,DNS:<FQDN>"
for expertflow namespace:
kubectl -n expertflow create secret tls ef-ingress-tls-secret --key <fqdn>.key --cert <fqdn>.crt
and for ef-external namespace
kubectl -n ef-external create secret tls ef-ingress-tls-secret --key <fqdn>.key --cert <fqdn>.crt
Ingress
For K3s-based deployments using the Traefik Ingress Controller
Apply the Ingress Routes.
kubectl apply -f cim/Ingresses/traefik/
For RKE2-based Ingresses using Ingress-Nginx Controller
decide the FQDN to be used in your solution and change the <FQDN> in the below-given command to your actual FQDN
sed -i 's/devops[0-9]*.ef.com/<FQDN>/g' cim/Ingresses/nginx/*
Apply the Ingress Routes.
kubectl apply -f cim/Ingresses/nginx/
Channel Manager Icons Bootstrapping
Once all expertflow service pods are completely up and running, execute these steps for media channel icons to render successfully,
Run the minio-helper pod using
kubectl apply -f scripts/minio-helper.yaml
wait for the pod to start and copy the Media Icons from external folder inside the help pod.
kubectl -n ef-external --timeout=90s wait --for=condition=ready pod minio-helper
and wait for the response pod/minio-helper condition met
Copy the files to the minio-helper pod.
kubectl -n ef-external cp post-deployment/data/minio/bucket/default minio-helper:/tmp/
Copy the icon-helper.sh script inside the minio-helper pod
kubectl -n ef-external cp scripts/icon-helper.sh minio-helper:/tmp/
execute the icon-helper.sh using
kubectl -n ef-external exec -it minio-helper -- /bin/sh /tmp/icon-helper.sh
delete the minio-helper pod
kubectl delete -f scripts/minio-helper.yaml
Configurations
Import default keyCloak realm for essential KeyCloak resources, permissions, and authentication configurations.
If you intend to use Apache Superset for reporting, follow Configure and import historical report templates to configure the Reporting solution.
For customer channel configuration, see customer channels.
For CX-Voice component deployment this guide
Chat Initiation URL
To setup customer widget follow this link
https://expertflow-docs.atlassian.net/wiki/x/TgE4CQ
{FQDN}→ FQDN of Kubernetes Deployment
Once all the deployments are successfully deployed, access the components to configure the solution. Keycloak is accessible at http://{cim-fqdn}/auth and unified-admin can be accessed using http://{cim-fqdn}/unified-admin and so on.
HA Testing Results/Remarks
Failover Testing | Strategy | Results / Changes Observed | Remarks |
---|---|---|---|
Node Failure | To acheive this we manually forced the node to be shut down. | After a node goes down kubernetes pods start shifting after a 5 minute wait window. This is the default behaviour of kubernetes. Previous node's pod | |
Node Failure CX Components | After 5 minute window these pods were moved to DR site, and spinning up issues were noticed in Routing engine Init and Conversation Controller Init. To solve this we have disabled the init containers for now. | New init conatiners would be designed that could have multiple end point for redis and mongo, so if one pod goes down they can communicate with others. | |
Node Failure Mongodb | If the primary pod is affected the either of the two replica becomes primary and starts its normal function this has been successful. Issue noticed with Arbiter running out of memory and Mongo tries to bring the original master up after a few hours on another node. Another fix applied is to remove arbiter from the connection string so components do not try to reach it | ||
Node Failure Redis | If the primary pod is affected the either of the two replica becomes primary and starts its normal function this has been successful. | ||
Node Failure Minio | Tested successfully. | ||
Node Failure Postgres | If the primary pod is affected the either of the two replica becomes primary and starts its normal function this has been successful. Postgres is running in async mode. | Postgres performance tweaks include turning on async mode and disabling limits. | |
Node Failure Keycloak | After 5 minute window a new pod is created which connects with HA postgres. If pod is stuck in scheduling use the kubectl delete pod --force command | kubectl delete pod --force command is needed if pod is stuck | |
Node Failure ActiveMQ | After 5 minute window a new pod is created which takes over replicated storage from Cstor. The previous pod needs to be manually terminated in case the new pod gets stuck in scheduling (kubectl delete pod --force command is needed to terminate previous one). We have moved ActiveMQ storage from local to postgreSQL | kubectl delete pod --force command is needed if pod is stuck. Storage is moved from local to postgres | |
OpenEBS Cstor | If the node goes down the virtual raw disk brings up a new disk identifier address that affects the replica pool. A physical storage would be preferred instead of a virtual disk. | ||