Backup and Restore Strategies and Recommendations for Expertflow CX

Backup are crucial to any organization and this is where Expertflow CX based solution's backups have an enforcement level to critical. We at Expertflow take backup very carefully and seriously, which is, in turn, a basic hygiene expected from customers deploying the Expertflow CX solution. Why backups are important and why backups should always be up to date and consistent is very much obvious for everyone with sensitivity of the data and due to always at the risk of threats, human errors or being hijacked ( internal or external ). Prioritizing the backups and their integrity are another aspect to consider to a recoverable backups scenario. This document describes some of the strategies and ultimately recommendations to solve the problems arising form poor management of backups.

What to Backup

All that is important to restore for a completely working solution. For Expertflow CX based solution, this essentially means all External components given below

databases like mongoDB, postgreSQL
File shared with/by end-user in minio ( S3 server )
Other components requiring special care for their data outside the above mentioned components.

Where the data is actually stored in Expertflow solution

All Expertflow CX components are stateless and maintain their information in resident external components ( except Routing Engine and few others which require rebuilding their state from scratch when restarted ) . For example

All components maintain their atomic state in Redis
All CX relations, history and communication relevant data is stored in noSQL database of MongoDB
Keycloak and some of the EF-CX plugins use PostgreSQL as their default RDBMS backend.
All files shared by/with end-user are stored internally in minio
RASA-X used for AI , saves its information in both Redis and PostgreSQL, which is managed separately
Superset using for BI uses and heavily depends on Redis and PostgreSQL for internal use
For External Reporting , Superset can consume multiple RDBMS which may include MSSQL-2019, MySQL, PostgreSQL ( managed separately. ) and may or may not reside under the same responsibility factor as EF-CX based components' backups.

All above given data is stored on disks using Kubernetes' storageclass specifications which may be different in different deployment, for example, on a single node deployment this may be limited to only a single node's disk, for a multi-node cluster, it may be completely abstract and comes under the jurisdiction of kubernetes administrator what approach has been used to store data.

Similarly, there is another approach evolving where deployment is made on a multi-node cluster with Application Level Replication ( all stateful applications like mongo, minio etc using local disk instead of cloud native storage while running multiple replicas on different nodes and maintaining internal replication at the application layer. ) . In this scenario, backups using Velero ( mentioned down below ) are not relevant and manual backups of the application data are considered more appropriate

Where to backup

at least 1 copy of the complete solution is available at a DR site. using any of the below given backup media

Recommendations

LTO-9 based offline backups
offline storage ( and often cheaper solutions ) provided by cloud providers like aws, gcp etc
SAN, NAS,
S3 server e.g minio
Locally available Disks/JBOD

How often

There are generically 2 categories for scheduling backups.

streaming DR, data is continuously synchronized to a remote site, which cover only DR but lacks protection against invasions and corruptions which might have already caused damage but not apparently visible like time based explosions, where the hackers might have already gained some or full access of the system but the corruption is yet to be performed. In this case recovery from streaming DR will not work. However, this approach works perfect for instant and point-in-time recovery.
offline DR, where all data is synced on regular intervals using versioning strategy , and we can work up to a last workable backup that we know that it will work. For example, on an S3 based offline backups, we can enable versioning, thus providing us the ability to restore from a last known working copy of the backup. However this lacks the right-point-of-time recovery due to timed nature of the process.

Recommendations

Currently at Expertflow we are offering semi-streaming DR solution, with below given details.

VELERO based solution is deployed with semi-streaming based mechanism

day 1 backup to a DR site taken manually.
continuous and scheduled but version controlled backup to a remote DR site. For Example
- component level backups: only a selected components are backed up with 60 minutes interval
- namespace level backups: a complete namespace snapshot every 3-6 hours
- cluster level backups: complete cluster snapshot twice or 4 times a day

The above frequency is a generic consideration that enough resources are allocated for backup processes to work independently without affecting the EF-CX solution's performance. However, if a higher frequency of backups is needed, it is recommended to use dedicated node for Velero backup ( with nodeSelector or nodeAffinity ).

consider backups schedules during low workload time for backup consistency if this is not 24/7 operation. For continuous operations, a dedicated node set must be used as mentioned earlier.

Validation of the Backups

simulate DR recovery with restoration of the full backup
restore the data on a test-bed system and explore the functionalities and recovery possibilities