Geographical Deployment of Expertflow CX with Redundancy

The purpose of this document is to provide steps for the Geo-cluster (multi-region) deployment of the Expertflow CX solution. The main purpose is to provide a redundant site for disaster recovery (DR) with the primary site.

Current Deployment Scenario

We have tested the Geo-cluster solution deployment on a collection of 3 master nodes with 3 worker nodes.

3 Control Plane nodes
3 Worker nodes
Cstor for replicated storage on worker nodes based on block storage

System Requirements

The system requirements for Geo-cluster solution using Kubernetes RKE2 distribution are:

RAM (GB)

CPU

DISK

Minimum Nodes

16

8

150 GB per node

100 additional unformatted block storage for each worker node.

3 master Node (each master will be deployed in a different region)
3 worker nodes ( with 100GB additional raw unformatted block storage each)

Storage Setup - cStor

cStor is the recommended resilient storage for Geo-cluster solution.

cStor uses the raw block devices attached to the Kubernetes worker nodes to create cStor Pools. There are raw (unformatted) block devices attached to the Kubernetes worker nodes. The devices can be either directly attached devices (SSD/HDD) or cloud volumes (GPD, EBS).

Deployment Steps for cStor

Deploy cStor on the first control plane node.

The deployment of cStor in our scenario is done using Helm. Helm helps us manage the Kubernetes application. Helm documentation can be accessed here. Helm deploys the components on all the added nodes automatically.

CODE

helm repo add openebs https://openebs.github.io/charts
helm repo update
helm install openebs --namespace openebs openebs/openebs --set cstor.enabled=true --create-namespace

To verify that pods are up and running, use the following command:

CODE

kubectl get pod -n openebs

Now, we need to block devices on each of the nodes (no file system must be present when the drive is mounted). To verify the presence of available block storage, use the following command:

CODE

kubectl get bd -n openebs

Sample output
NAME                                          NODENAME         SIZE         CLAIMSTATE  STATUS   AGE
blockdevice-01afcdbe3a9c9e3b281c7133b2af1b68  worker-node-3    21474836480   Unclaimed   Active   2m10s
blockdevice-10ad9f484c299597ed1e126d7b857967  worker-node-1    21474836480   Unclaimed   Active   2m17s
blockdevice-3ec130dc1aa932eb4c5af1db4d73ea1b  worker-node-2    21474836480   Unclaimed   Active   2m12s

The above command shows the node name and its block device. Make sure that all the worker nodes have block devices present as this is what will be used when deploying a replicated storage pool.

Creation of Storage Pool

Use the above block devices to create a storage pool. Create a new file called

cspc.yaml and modify its content as below:

CODE

apiVersion: cstor.openebs.io/v1
kind: CStorPoolCluster
metadata:
 name: cstor-disk-pool
 namespace: openebs
spec:
 pools:
   - nodeSelector:
       kubernetes.io/hostname: "worker-node-1"
     dataRaidGroups:
       - blockDevices:
           - blockDeviceName: "blockdevice-10ad9f484c299597ed1e126d7b857967"
     poolConfig:
       dataRaidGroupType: "stripe"

   - nodeSelector:
       kubernetes.io/hostname: "worker-node-2"
     dataRaidGroups:
       - blockDevices:
           - blockDeviceName: "blockdevice-3ec130dc1aa932eb4c5af1db4d73ea1b"
     poolConfig:
       dataRaidGroupType: "stripe"

   - nodeSelector:
       kubernetes.io/hostname: "worker-node-3"
     dataRaidGroups:
       - blockDevices:
           - blockDeviceName: "blockdevice-01afcdbe3a9c9e3b281c7133b2af1b68"
     poolConfig:
       dataRaidGroupType: "stripe"

In the above command, the block device ID reflects the node on which it is attached. In NodeSelector, add the hostname of the node.

To get the nodeSelector value for each host, run the following command:

CODE

kubectl get node --show-labels

Edit the hostname and the block ID relevant to each node on the above cspc.yaml file. Once done run the following command:

CODE

kubectl apply -f cspc.yaml

To verify that all the block devices are part of the storage pool, run the following command. This usually takes around 3-5 minutes.

CODE

kubectl get cspc -n openebs

NAME                   HEALTHYINSTANCES   PROVISIONEDINSTANCES   DESIREDINSTANCES     AGE
cstor-disk-pool        3                  3                      3                    2m2s

Now verify each block device has its pool online with the following command:

CODE

kubectl get cspi -n openebs

NAME                  HOSTNAME             ALLOCATED   FREE    CAPACITY   STATUS   AGE
cstor-disk-pool-vn92  worker-node-1        60k         9900M    9900M     ONLINE   2m17s
cstor-disk-pool-al65  worker-node-2        60k         9900M    9900M     ONLINE   2m17s
cstor-disk-pool-y7pn  worker-node-3        60k         9900M    9900M     ONLINE   2m17s

Once the above is verified we now need to create a storage class.

Create a file name cstor-csi-disk.yaml and paste the below contents into it.

CODE

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: cstor-csi-disk
provisioner: cstor.csi.openebs.io
allowVolumeExpansion: true
parameters:
  cas-type: cstor
  # cstorPoolCluster should have the name of the CSPC
  cstorPoolCluster: cstor-disk-pool
  # replicaCount should be <= no. of CSPI created in the selected CSPC
  replicaCount: "3"

After copying the above contents in the

cstor-csi-disk.yaml, apply it using the following command:

CODE

kubectl apply -f cstor-csi-disk.yaml

Following the above command, you will have a cStor storage class. You can verify it by using the following command:

CODE

kubectl get sc

The output of the command shows the three possible storage classes - one for cStor and the other two for the local provisioner:

CODE

cstor-csi-disk               cstor.csi.openebs.io   Delete          Immediate              true                   34h
openebs-device               openebs.io/local       Delete          WaitForFirstConsumer   false                  34h
openebs-hostpath (default)   openebs.io/local       Delete          WaitForFirstConsumer   false                  34h

OpenEBS for Local Storage

Deploying OpenEBS enables localhost storage as target devices. We have deployed all the components using open OpenEBS local path storage.

Make the storage class as default, and replace <name> with your storage class name:

CODE

kubectl patch storageclass openebs-hostpath -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Node Affinity-Based Deployment

To ensure the pods of the CX solution components are bound to site A as long as it is available, we have applied Node Affinity to CX components with assigned weightage which ensure pods are spined up on Site A when they are first deployed. In any case Node A becomes unavailable, and pods will be shifted from site A to site B.

For the replicas of the stateful set components, we have applied Pod anti-affinity which ensures the replicas are spined up on site B and no two Primary and Replica Pods are running on the same node if both sites are available. (i.e. Mongodb's replica pod will not be in site A if site A is also hosting Mongodb's primary pod.)

Tainting Control Plane Nodes

By default, a control plane node can manage application workloads as well. This is okay for a lighter workload (~50 concurrent conversations) and CX Single Node Deployment. But, for a higher workload or a multi-cluster setup, all control plane nodes should be tainted to schedule control-plane pods only.

First, get the nodes to identify which are control-plane/master nodes.

CODE

kubectl get nodes

Then to taint the master nodes, use the following command for each master node.

CODE

kubectl taint nodes (nodename) node-role.kubernetes.io/master:NoSchedule

Once done allow the RKE Ingress to spin up on the control plane as well.

CODE

kubectl patch ds rke2-ingress-nginx-controller -n kube-system --type json -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "node-role.kubernetes.io/master", "operator": "Exists", "effect": "NoSchedule

Expertflow CX Internal Components

Step 1: Clone the Experflow CX Repository

Start with cloning the repository from GitLab.

BASH

git clone -b <branch-name> https://efcx:RecRpsuH34yqp56YRFUb@gitlab.expertflow.com/cim/cim-solution.git

Modify the <release-branch> with your desired release branch name.

Change to the directory to locate all the deployment yaml files.

BASH

cd cim-solution/kubernetes

Step 2: Blueprint for Node Affinity on CX components

We will be using Node Affinity to keep the workload on the primary site at the start, only in case of downtime will the node shift towards any other site.

To apply Node Affinity, we first need to label our worker nodes.

First, get the workers nodes by using the following command.

CODE

kubectl get nodes --show-labels
 
Output will be similar to this
 
NAME       STATUS   ROLES                       AGE   VERSION        LABELS
vm3     Ready    control-plane,etcd,master   37d   v1.24.7+k3s1   ..node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io...
vm05   Ready    <none>                      37d   v1.24.7+k3s1   ..egress.k3s.io/cluster=true,env=cti,kubernetes.io/arch=amd64..
vm1    Ready    control-plane,etcd,master   37d   v1.24.7+k3s1   ..egress.k3s.io/cluster=true,env=cim,kubernetes.io/arch=amd64..

Now we will need to label these worker nodes, and label the primary site worker nodes with the "primary" label.

CODE

kubectl label nodes <node name> site=primary
 
Example:
kubectl label nodes vm05 site=primary

Use the above command to label all the primary site worker nodes appropriately as above. You can also label your secondary site worker nodes similarly if needed according to your needs.

Step 3: Blueprint for applying Affinity on the Deployments.

Once labeling is completed for the primary site worker nodes, verify if the affinity block has been added in the pod deployment yaml files to allow the pods to always start spinning up in the primary site.

Yaml files can be found in the following directory.

CODE

cd /cim/Deployments

Open up any of the yaml files and it should have the following node affinity values as uncommented down below.

It needs to be like the following example below for all yaml files.

CODE

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    ef.service: ef-agent-manager
    ef: expertflow
  name: ef-agent-manager
  namespace: expertflow

spec:
  replicas: 1
  selector:
    matchLabels:
      ef.service: ef-agent-manager

  strategy: {}
  template:
    metadata:
      labels:
        ef.service: ef-agent-manager
        ef: expertflow
    spec:
      imagePullSecrets:
      - name: expertflow-reg-cred
#      affinity:
#        nodeAffinity:
#          preferredDuringSchedulingIgnoredDuringExecution:
#          - weight: 50
#            preference:
#              matchExpressions:
#              - key: site
#                operator: In
#                values:
#                - worker-a

These are the exact value files that need to be commented out for node affinity to work in all the deployment yaml file. Change the "Value" annotation to the label you assigned to the worker node in Step 2. In the above example, we will change the "value" from "worker-a" to "primary" or to whatever label you have assigned to your worker nodes.

CODE

      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            preference:
              matchExpressions:
              - key: site
                operator: In
                values:
                - primary

Step 4: Disabling Init Containers in Deployment files.

We will also need to disable init containers in the same path as

CODE

cd /cim/Deployments

First we will disable init containers for ef-conversation controller yaml.

CODE

vi ef-conversation-controller-deployment.yaml

And make sure the below lines are commented out containing the init container, all of these lines need to be commented out.

CODE

##      initContainers:
##      - name: wait-for
##        image: ghcr.io/patrickdappollonio/wait-for:latest
##        imagePullPolicy: IfNotPresent
##        env:
##        - name: MONGO_HOST
##          value: "mongo-mongodb.ef-external.svc.cluster.local:27017"
##        - name: REDIS_HOST
##          value: "redis-master.ef-external.svc.cluster.local:6379"
##        command:
##          - /wait-for
##        args:
##          - --host="$(MONGO_HOST)"
##          - --host="$(REDIS_HOST)"
##          - --verbose

Now once these are done we will also comment out for routing engine deployment yaml

CODE

vi ef-routing-engine-deployment.yaml

Again comment out the following lines.

CODE

##      initContainers:
##      - name: wait-for
##        image: ghcr.io/patrickdappollonio/wait-for:latest
##        imagePullPolicy: IfNotPresent
##        env:
##        - name: MONGO_HOST
##          value: "mongo-mongodb.ef-external.svc.cluster.local:27017"
##        - name: REDIS_HOST
##          value: "redis-master.ef-external.svc.cluster.local:6379"
##        command:
##          - /wait-for
##        args:
##          - --host="$(MONGO_HOST)"
##          - --host="$(REDIS_HOST)"
##          - --verbose

The below architecture is used while using pod anti-affinity, pod affinity, node affinity, and anti-affinity.

Components	Node Affinity	Pod anti affinity	NodeSelector
CIM Solution Components (CX Components)	✓
Postgres (master)			✓
Postgres (replica)
Mongodb (replica and master)		✓
Redis (master)			✓
Redis (replica)		✓
Keycloak	✓

Step 5: Install Rancher OPTIONAL STEP

Rancher is web-UI for managing Kubernetes clusters.

To deploy the Rancher Web-UI, add the Helm repository.

Install the cert-manager required for the Rancher.

After installation, wait for at least 30 seconds for cert-manager to start

BASH

helm upgrade --install=true cert-manager \
--wait=true \
--timeout=10m0s \
 --debug \
--namespace cert-manager \
--create-namespace \
--version v1.10.0 \
--values=external/cert-manager/values.yaml \
external/cert-manager

Use the following command to see if all cert-manager pods are up and running.

BASH

kubectl get pods -n cert-manager

Deploy the rancher using Helm Chart.

BASH

helm upgrade --install=true --wait=true   --timeout=10m0s  --debug  rancher --namespace cattle-system --create-namespace --values=external/rancher/values.yaml  external/rancher

Rancher is by default not accessible outside the cluster. To make it accessible, change the service type from Cluster-IP to NodePort:

BASH

kubectl -n cattle-system patch svc rancher  -p '{"spec": {"type": "NodePort"}}'

Get the Rancher Service port by using the following command:

BASH

kubectl -n cattle-system  get svc rancher -o go-template='{{(index .spec.ports 1).nodePort}}';echo;

Now you can access the Rancher Web UI. It will be accessible at any-node-ip-of-cluster:PORT-from-above-command.

default username/password is admin/ExpertflowRNCR

Step 6: Create Namespace

All Expertflow components are deployed in a separate namespace inside the Kubernetes called 'expertflow'.

Run the following command on the master node. Create the namespace using the command.

BASH

kubectl create namespace expertflow

All external components will be deployed in ef-external namespace. Run the following command on the master node.

BASH

kubectl create namespace ef-external

Step 7: Image Pull secret

For expertflow namespace, use the following command:

BASH

kubectl apply -f pre-deployment/registryCredits/ef-imagePullSecret-expertflow.yaml

Run the following command for ef-external namespace:

BASH

kubectl apply -f pre-deployment/registryCredits/ef-imagePullSecret-ef-external.yaml

Step 8: Update the FQDN

Decide the FQDN to be used in your solution and change the

<FQDN> to your actual FQDN as given in the following command:

BASH

sed -i 's/devops[0-9]*.ef.com/<FQDN>/g' cim/ConfigMaps/* pre-deployment/grafana/*  pre-deployment/keycloak/*  cim/Ingresses/traefik/*

Replace FQDN with the name of your Master Node FQDN when deploying the solution on Single Control Plane node. For Multi-Control-plane setup, use VIP or FQDN associated with VIP

Expertflow CX External Components

Following are the required external components that need to be deployed with Expertflow CX:

Blueprint for Pod anti-affinity on Replica Pods For Helm Based Deployments (Optional)

These methods are to be used for helm-based deployments i.e. all the external components so that their replica pods don't spin up on the same node if it is not done on it's own. These changes can be made on any helm file with the below pre-existing values in its .yaml file.

1. To apply pod anti-affinity, we first need to label pods so they can be segregated based on their labeling. To label pods, the following should be added to the value file:

CODE

  labels:
    app: store

Next, once the pod has been assigned a label,

To set pod anti-affinity change the following values in the helm file

CODE

  affinity:
     podAntiAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
       - labelSelector:
           matchExpressions:
           - key: app
             operator: In
             values:
             - store
         topologyKey: "kubernetes.io/hostname"

In the above code, the topology key refers to the node label on which you are applying pod anti-affinity to ensure no two pods whose label has been assigned as app with value store be spinning up together on this node.
Deploying External Components:

1. PostgreSQL

To deploy Postgres in high availability we will use Postgre's pgpool which provides automated failover while also ensuring high availability in case any of the master pods is affected within a node. This deployment can be done directly from the helm file only changes needed are amount of replica pods to be deployed which can be adjusted. This value is changed on both under PostgreSQL and under pgpool as well.

To change the amount of replicas edit the following value in values.yaml file which can be found using the command below.

CODE

cd /external/bitnami/postgresql-ha/values.yaml

CODE

postgresql:

  image:
    registry: docker.io
    repository: bitnami/postgresql-repmgr
    tag: 15.3.0-debian-11-r2
    digest: ""
    ## Specify a imagePullPolicy. Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
    ## ref: https://kubernetes.io/docs/user-guide/images/#pre-pulling-images
    ##
    pullPolicy: IfNotPresent
    ## Optionally specify an array of imagePullSecrets.
    ## Secrets must be manually created in the namespace.
    ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
    ## Example:
    ## pullSecrets:
    ##   - myRegistryKeySecretName
    ##
    pullSecrets: []
    ## Set to true if you would like to see extra information on logs
    ##
    debug: false
  ## @param postgresql.labels Labels to add to the StatefulSet. Evaluated as template
  ##
  labels: {}
  replicaCount: 3

--------------------------------------------------------------------------------------------------------------------------------------------------------

pgpool:
  ## Bitnami Pgpool image
  ## ref: https://hub.docker.com/r/bitnami/pgpool/tags/
  ## @param pgpool.image.registry Pgpool image registry
  ## @param pgpool.image.repository Pgpool image repository
  ## @param pgpool.image.tag Pgpool image tag
  ## @param pgpool.image.digest Pgpool image digest in the way sha256:aa.... Please note this parameter, if set, will override the tag
  ## @param pgpool.image.pullPolicy Pgpool image pull policy
  ## @param pgpool.image.pullSecrets Specify docker-registry secret names as an array
  ## @param pgpool.image.debug Specify if debug logs should be enabled
  ##
  image:
    registry: docker.io
    repository: bitnami/pgpool
    tag: 4.4.2-debian-11-r33
    digest: ""
    ## Specify a imagePullPolicy. Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
    ## ref: https://kubernetes.io/docs/user-guide/images/#pre-pulling-images

   replicaCount: 3

By default the value is set to 3 based on the number of nodes this can be changed as needed.

PostgreSQL is deployed as a central datastore for both LicenseManager and Keycloak.

Create configmap for PostgreSQL to load the LicenseManager database and create keycloak_db:

CODE

kubectl -n ef-external  create configmap ef-postgresql-license-manager-cm --from-file=./pre-deployment/licensemanager/licensemanager.sql

Helm command for postgreSQL for clusters as given below:

CODE

helm upgrade --install=true --wait=true  --timeout=10m0s  --debug --namespace=ef-external --values=external/bitnami/postgresql-ha/values.yaml ef-postgresql external/bitnami/postgresql-ha

2. Keycloak

Since keycloak doesn't offer high availability within itself, we manage it by providing it with the external Postgres as deployed above and changing its internal database to the externally deployed Postgres.

The following changes need to be made in the keycloaks' helm values-ha.yaml file, it can be found in the following location

CODE

cd /external/bitnami/keycloak/

CODE

postgresql:
  enabled: false
  auth:
    username: bn_keycloak
    password: ""
    database: bitnami_keycloak
    existingSecret: ""
  architecture: standalone
## External PostgreSQL configuration
## All of these values are only used when postgresql.enabled is set to false
## @param externalDatabase.host Database host
## @param externalDatabase.port Database port number
## @param externalDatabase.user Non-root username for Keycloak
## @param externalDatabase.password Password for the non-root username for Keycloak
## @param externalDatabase.database Keycloak database name
## @param externalDatabase.existingSecret Name of an existing secret resource containing the database credentials
## @param externalDatabase.existingSecretPasswordKey Name of an existing secret key containing the database credentials
## EXPERTFLOW
externalDatabase:
  host: "ef-postgresql-postgresql-ha-pgpool.ef-external.svc.cluster.local"
  port: 5432
  user: sa
  database: keycloak_db
  password: "Expertflow123"
  existingSecret: ""
  existingSecretPasswordKey: ""

On the master node, create a global configmap for keycloak. Change the hostname and other parameters before applying this command in

ef-keycloak-configmap.yaml file:

CODE

kubectl apply -f pre-deployment/keycloak/ef-keycloak-configmap.yaml

The Helm command for Keycloak is given below:

CODE

helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/keycloak/values-ha.yaml keycloak  external/bitnami/keycloak/

3. Mongo DB

To enable high availability for MongoDB the following changes need to be made in the mongodb's helm value file. Arbiter needs to be set as true. Affinity needs to be applied as below.

ReplicaCount needs to be set as per available worker nodes and hostname needs to be enabled while setting the appropriate replicasetname as per below.

Helm file values-ha.yaml can be located at

CODE

cd /external/bitnami/keycloak/

CODE

arbiter:
  affinity: {}
  annotations: {}
  args: []
  command: []
  configuration: ""
  containerPorts:
    mongodb: 27017
  containerSecurityContext:
    enabled: true
    runAsNonRoot: true
    runAsUser: 1001
  customLivenessProbe: {}
  customReadinessProbe: {}
  customStartupProbe: {}
  enabled: true
-------------------------------------------------------------------------------------
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - store
      topologyKey: kubernetes.io/hostname
annotations: {}
--------------------------------------------------------------------------------------

replicaCount: 3
replicaSetConfigurationSettings:
  configuration:
    catchUpTimeoutMillis: 30000
    chainingAllowed: false
    electionTimeoutMillis: 10000
    heartbeatIntervalMillis: 2000
    heartbeatTimeoutSecs: 20
  enabled: true
replicaSetHostnames: true
replicaSetName: expertflow

1.Helm deployment for Mongo command is given below

CODE

helm upgrade --install=true --wait=true --timeout=10m0s  --debug --namespace=ef-external --values=external/bitnami/mongodb/values-ha.yaml mongo  external/bitnami/mongodb/

4. MinIO

To deploy minio in high availability, the following changes can be made to the helm value file for minio. The mode needs to be selected as distributed. ReplicaCount needs to be set as per need but it should be in even numbers and greater than or equal to 4, Zone should be 1, and drives per node should be 1 as well. Affinity needs to be applied as per below based on the first set pod label, then set value in affinity block.

Helm file values-ha.yaml can be located at

CODE

cd /external/bitnami/minio/

CODE

clientImage:
  registry: docker.io
  repository: bitnami/minio-client
  tag: 2022.12.13-debian-11-r0
  digest: ""
## @param mode MinIO® server mode (`standalone` or `distributed`)
## ref: https://docs.minio.io/docs/distributed-minio-quickstart-guide
##
mode: distributed


--------------------------------------------------------------------------------------

statefulset:
  ## @param statefulset.updateStrategy.type StatefulSet strategy type
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies
  ## e.g:
  ## updateStrategy:
  ##  type: RollingUpdate
  ##  rollingUpdate:
  ##    maxSurge: 25%
  ##    maxUnavailable: 25%
  ##
  updateStrategy:
    type: RollingUpdate
  ## @param statefulset.podManagementPolicy StatefulSet controller supports relax its ordering guarantees while preserving its uniqueness and identity guarantees. There are two valid pod management policies: OrderedReady and Parallel
  ## ref: https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#pod-management-policy
  ##
  podManagementPolicy: Parallel
  ## @param statefulset.replicaCount Number of pods per zone (only for MinIO® distributed mode). Should be even and `>= 4`
  ##
  replicaCount: 4
  zones: 1
  ## @param statefulset.drivesPerNode Number of drives attached to every node (only for MinIO® distributed mode)
  ##
  drivesPerNode: 1
-------------------------------------------------------------------------------------------------------------------------
podLabels:
  app: minio
nodeAffinityPreset:
  ## @param nodeAffinityPreset.type Node affinity preset type. Ignored if `affinity` is set. Allowed values: `soft` or `hard`
  ##
  type: ""
  ## @param nodeAffinityPreset.key Node label key to match. Ignored if `affinity` is set.
  ## E.g.
  ## key: "kubernetes.io/e2e-az-name"
  ##
  key: ""
  ## @param nodeAffinityPreset.values Node label values to match. Ignored if `affinity` is set.
  ## E.g.
  ## values:
  ##   - e2e-az1
  ##   - e2e-az2
  ##
  values: []
## @param affinity Affinity for pod assignment. Evaluated as a template.
## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
## Note: podAffinityPreset, podAntiAffinityPreset, and nodeAffinityPreset will be ignored when it's set
##

  affinity:
     podAntiAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
       - labelSelector:
           matchExpressions:
           - key: app
             operator: In
             values:
             - minio
         topologyKey: "kubernetes.io/hostname"

CODE

helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/minio/values-ha.yaml minio external/bitnami/minio/

5. Redis Sentinel

To provide high availability to Redis we have opted to deploy Redis sentinel, which is a high-availability solution designed to enhance the reliability and fault-tolerance version of Redis. At its core, Redis Sentinel enables the creation of a robust Redis deployment consisting of multiple Redis instances and Sentinel nodes. These Sentinel nodes constantly monitor the health of the Redis instances and automatically detect any failures or performance degradation. Upon detecting an issue, Sentinel orchestrates the failover process, promoting a standby Redis instance to become the new master.

To Enable Sentinel and Set Amount of Replicas.

To enable sentinel edit the helm value file for redis with the following changes. helm file can be found at

Helm file values-ha.yaml can be located at

CODE

cd /external/bitnami/redis/

CODE

sentinel:
  ## @param sentinel.enabled Use Redis® Sentinel on Redis® pods.
  ## IMPORTANT: this will disable the master and replicas services and
  ## create a single Redis® service exposing both the Redis and Sentinel ports
  ##
  enabled: true
  ## Bitnami Redis® Sentinel image version

Set the enabled flag as True this will allow a replica to become a master in case one of the pods gets affected in a node.

To set the amount of replicas change the following value in the redis helm value file.

CODE

replica:
  ## @param replica.replicaCount Number of Redis® replicas to deploy
  ##
  replicaCount: 3
  ## @param replica.configuration Configuration for Redis® replicas nodes
  ## ref: https://redis.io/topics/config
  ##

Set the amount of replicas as needed based on the number of nodes.

CODE

helm upgrade --install=true --wait=true --timeout=10m0s --debug --namespace=ef-external --values=external/bitnami/redis/values-ha.yaml redis-ha external/bitnami/redis/

Deploying Stateful Components

To circumvent ActiveMQ's availability we have provided it with cloud-replicated storage to keep its storage in, this solves the high availability challenge. We will be using cstor that we deployed above to provide cloud-replicated storage. And updating its connection strings if not already updated. Below is the list of changes made to its deployment yaml containing the changes.

Changes in Spec: Env

The file to be edited ef-amq-statefulset-ha.yaml is located at cim/StatefulSet/

CODE

vi cim/StatefulSet/ef-amq-statefulset-ha.yaml

Here we will be providing it with connection details for redis as well as postgres.

CODE

         env:
           - name: REDIS_HOST
             value: redis-master.ef-external.svc.cluster.local
           - name: REDIS_PORT
             value: "6379"
           - name: REDIS_PASSWORD
             value: Expertflow123
           - name: REDIS_SSL_ENABLED
             value: "false"
           - name: REDIS_MAX_ACTIVE
             value: "100"
           - name: REDIS_MAX_IDLE
             value: "100"
           - name: REDIS_MAX_WAIT
             value: "-1"
           - name: REDIS_MIN_IDLE
             value: "50"
           - name: REDIS_TIMEOUT
             value: "2000"
           - name: REDIS_SENTINEL_ENABLE
             value: "true"
           - name : REDIS_SENTINEL_MASTER
             value: "expertflow"
           - name : REDIS_SENTINEL_NODES
             value: "redis-ha-node-0.redis-ha-headless.ef-external.svc.cluster.local:26379,redis-ha-node-1.redis-ha-headless.ef-external.svc.cluster.local:26379,redis-ha-node-2.redis-ha-headless.ef-external.svc.cluster.local:26379"
           - name : REDIS_SENTINEL_PASSWORD
             value: "Expertflow123"
           - name: DB_URL
             value: ef-postgresql-postgresql-ha-pgpool.ef-external.svc
           - name: DB_USER

StatefulSet

ActiveMQ should be deployed before all other solution components. To deploy ActiveMQ as StatefulSet run.

CODE

kubectl apply -f cim/StatefulSet/ef-amq-statefulset-ha.yaml

Wait for the AMQ StatefulSet

CODE

kubectl wait pods ef-amq-0  -n ef-external   --for condition=Ready --timeout=600s

Deploying CX Components

ConfigMaps

Conversation Manager ConfigMaps

If you need to change the default training, please update the corresponding files.

CODE

kubectl -n expertflow create configmap ef-conversation-controller-actions-cm --from-file=pre-deployment/conversation-Controller/actions
kubectl -n expertflow create configmap ef-conversation-controller-actions-pycache-cm --from-file=pre-deployment/conversation-Controller/__pycache__
kubectl -n expertflow create configmap ef-conversation-controller-actions-utils-cm --from-file=pre-deployment/conversation-Controller/utils

Reporting Connector ConfigMaps

Please update the "fqdn, browser_language, connection_type and Database server connection parameters" in the file pre-deployment/reportingConnector/reporting-connector.conf and then deploy.

CODE

kubectl -n expertflow create configmap ef-reporting-connector-conf --from-file=pre-deployment/reportingConnector/reporting-connector.conf

kubectl -n expertflow create configmap ef-reporting-connector-cron --from-file=pre-deployment/reportingConnector/reporting-connector-cron

Unified Agent ConfigMaps

Translations for the unified agent are applicable in HC-4.1 and later releases.

CODE

kubectl -n expertflow  create configmap ef-app-translations-cm --from-file=pre-deployment/app-translations/unified-agent/i18n

ConfigMaps have values that need to be uncommented for HA enablement. 1. Redis Sentinel 2. Mongodb

Edit the connection_env file in cim/ConfigMaps

CODE

vi cim/ConfigMaps/ef-connection-env-configmap.yaml

Enable the Redis Sentinel Flag in this and comment the single MongoDB host file and uncomment the multiple host file for MongoDB as show in below example

CODE

  ##MONGODB_HOST: mongodb://mongo-mongodb.ef-external.svc.cluster.local
  MONGODB_HOST: mongodb://mongo-mongodb-0.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-1.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-2.mongo-mongodb-headless.ef-external.svc.cluster.local:27017/?replicaSet=expertflow&tls=false&ssl=false&retrywrites=true&w=majority
-------------------------------------------------------------------------------------------
REDIS_SENTINEL_ENABLE: "true"

Now make changes to License Manager ConfigMaps

CODE

vi cim/ConfigMaps/ef-license-manager-configmap.yaml

Uncomment the postgres-ha DB_URL as mentioned below and comment out the simple postgres DB URL.

CODE

   #DB_URL: jdbc:postgresql://ef-postgresql.ef-external.svc.cluster.local:5432/licenseManager
   DB_URL: jdbc:postgresql://ef-postgresql-postgresql-ha-pgpool.ef-external.svc.cluster.local:5432/licenseManager

Apply all the configmap in ConfigMaps folder using

CODE

kubectl apply -f cim/ConfigMaps/

Services

Create services for all deployment EF components

CODE

kubectl apply -f cim/Services/

Services must be created before Deployments

Deployments

apply all the Deployment manifests

CODE

kubectl apply -f cim/Deployments/

Team Announcement CronJob

Team announcement cron job is applicable in HC-4.2 and later releases.

CODE

kubectl apply -f pre-deployment/team-announcement/

Import your own certificates

Now generate a secret with the certificate files. You must have a private.key and server.crt files available on the machine and in the correct directory.

for expertflow namespace:

CODE

kubectl -n expertflow create secret tls ef-ingress-tls-secret \
--key pre-deployment/certificates/server.key \
--cert pre-deployment/certificates/server.crt

and for ef-external namespace

CODE

kubectl -n ef-external create secret tls ef-ingress-tls-secret \
--key pre-deployment/certificates/server.key \
--cert pre-deployment/certificates/server.crt

Import your own certificates for RKE

Now generate a secret with the following commands.

please modify the <FQDN> with your current fqdn before applying this command.

BASH

openssl req -x509 \
-newkey rsa:4096 \
-sha256 \
-days 3650 \
-nodes \
-keyout <fQDN>.key \
-out <FQDN>.crt \
-subj "/CN=<FQDN>" \
-addext "subjectAltName=DNS:www.<FQDN>,DNS:<FQDN>"

for expertflow namespace:

BASH

kubectl -n expertflow create secret tls ef-ingress-tls-secret --key  <fqdn>.key --cert <fqdn>.crt

and for ef-external namespace

BASH

kubectl -n ef-external  create secret tls ef-ingress-tls-secret --key  <fqdn>.key --cert <fqdn>.crt

Ingress

For K3s-based deployments using the Traefik Ingress Controller

Apply the Ingress Routes.

CODE

kubectl apply -f cim/Ingresses/traefik/

For RKE2-based Ingresses using Ingress-Nginx Controller

decide the FQDN to be used in your solution and change the <FQDN> in the below-given command to your actual FQDN

CODE

sed -i 's/devops[0-9]*.ef.com/<FQDN>/g'    cim/Ingresses/nginx/*

Apply the Ingress Routes.

BASH

kubectl apply -f cim/Ingresses/nginx/

Channel Manager Icons Bootstrapping

Once all expertflow service pods are completely up and running, execute these steps for media channel icons to render successfully,

Run the minio-helper pod using

CODE

kubectl apply -f scripts/minio-helper.yaml

wait for the pod to start and copy the Media Icons from the external folder to inside the help pod.

CODE

 kubectl -n ef-external --timeout=90s wait --for=condition=ready pod minio-helper

and wait for the response pod/minio-helper condition met

Copy the files to the minio-helper pod.

CODE

kubectl -n ef-external cp post-deployment/data/minio/bucket/default minio-helper:/tmp/

Copy the icon-helper.sh script inside the minio-helper pod

CODE

 kubectl -n ef-external cp scripts/icon-helper.sh minio-helper:/tmp/

execute the icon-helper.sh using

CODE

kubectl -n ef-external exec -it minio-helper -- /bin/sh /tmp/icon-helper.sh

delete the minio-helper pod

CODE

kubectl delete -f scripts/minio-helper.yaml

Chat Initiation URL

The web-init-widget is now capable of calling the deployment of CIM from within the URL

CODE

https://{FQDN}/web-widget/cim-web-init-widget/?customerWidgetUrl=https://{FQDN}/customer-widget&widgetIdentifier=Web&serviceIdentifier=1122&channelCustomerIdentifier=1133

For the chat history, use the following URL

CODE

https://{FQDN}/web-widget/chat-transcript/

{FQDN}→ FQDN of Kubernetes Deployment

Once all the deployments are successfully deployed, access the components to configure the solution. Keycloak is accessible at http://{cim-fqdn}/auth and unified-admin can be accessed using http://{cim-fqdn}/unified-admin and so on.

HA Testing Results/Remarks

Failover Testing	Strategy	Results / Changes Observed	Remarks
Node Failure	To acheive this we manually forced the node to be shut down.	After a node goes down kubernetes pods start shifting after a 5 minute wait window. This is the default behaviour of kubernetes. Previous node's pod
Node Failure CX Components		After 5 minute window these pods were moved to DR site, and spinning up issues were noticed in Routing engine Init and Conversation Controller Init. To solve this we have disabled the init containers for now.	New init conatiners would be designed that could have multiple end point for redis and mongo, so if one pod goes down they can communicate with others.
Node Failure Mongodb		If the primary pod is affected the either of the two replica becomes primary and starts its normal function this has been successful. Issue noticed with Arbiter running out of memory and Mongo tries to bring the original master up after a few hours on another node. Another fix applied is to remove arbiter from the connection string so components do not try to reach it
Node Failure Redis		If the primary pod is affected the either of the two replica becomes primary and starts its normal function this has been successful.
Node Failure Minio		Tested successfully.
Node Failure Postgres		If the primary pod is affected the either of the two replica becomes primary and starts its normal function this has been successful. Postgres is running in async mode.	Postgres performance tweaks include turning on async mode and disabling limits.
Node Failure Keycloak		After 5 minute window a new pod is created which connects with HA postgres. If pod is stuck in scheduling use the kubectl delete pod --force command	kubectl delete pod --force command is needed if pod is stuck
Node Failure ActiveMQ		After 5 minute window a new pod is created which takes over replicated storage from Cstor. The previous pod needs to be manually terminated in case the new pod gets stuck in scheduling (kubectl delete pod --force command is needed to terminate previous one). We have moved ActiveMQ storage from local to postgreSQL	kubectl delete pod --force command is needed if pod is stuck. Storage is moved from local to postgres
OpenEBS Cstor		If the node goes down the virtual raw disk brings up a new disk identifier address that affects the replica pool. A physical storage would be preferred instead of a virtual disk.

HA Open Issues

Authenticate to retrieve your issues

No issues found