Failover Scenarios

System Failover
Component Failover

System Failover

Normally, the High Availability(HA) setup is configured to forward and serve all the user traffic from the Primary System. HAProxy servicing as a load balancer and aggregator on top of both Primary and Secondary always takes care of the failover in such a way that it automatically shifts traffic load to the Secondary system in case the primary system goes down. This failover is automatic and does not require any manual effort involved. By design, there are multiple services ( MRE, Chat Server, and Communication Server ) stopped explicitly at the secondary system, and in case of failure at the Primary location, the Secondary system has to be prepared for traffic. This is accomplished by running a set of procedures on the Secondary.

all the HA related scripts mentioned below are located in <deployment-path>/HAScripts folder

Renounce Scripts are to be run on the machine, that you want to shut down.
Takeover Scripts are to be run on the machine, that you want to be active.

Scenario

Expected

Actions

Primary Down

All traffic should be routed and served at the Secondary System

Execute these scripts on the Secondary system to make it active and Primary.

1) takeover.sh
2) takeover_reporting.sh

Primary Recovers

Secondary System should be renounced in favor of Primary

After the Primary system recovery, follow these procedures to make is active.

On Secondary

1) renounce.sh
2) renounce_reporting.sh

On Primary

and then either restart the complete solution using

1) efutils all up

or

1)takeover.sh

2)takeover_reporting.sh

Renounce Scripts is to be run on the machine, that you won

Component Failover

MongoDB

Assuming that MongoDB is set up in a 3-node replica-set with Arbiter running on a separate VM.

	If	Expected	Result
1	Primary Mongo instance is down		Failover occurs with R/W operations from Secondary (now elected as Primary)
	Arbiter is down		The election is halted – Primary continues R/W operations. Secondary serves only READONLY.
3	Secondary Mongo instance is down		R/W operations continue from Primary. Once Secondary recovers it synchronizes data from PRIMARY.
4	Both Primary and Arbiter mongo instances are down		Secondary serves READONLY mode
5	Network-link of Secondary is down		Primary continues the R/W operations with Secondary disconnected. The secondary will be in READONLY mode.
	Network-link of Secondary is restored		Upon link restoration Secondary automatically synchronizes data from Primary.
6	Network-link of Primary is down		The secondary is elected as Primary.
	Network-link of Primary is restored		When the link is restored, the PRIMARY with id '0' is given precedence in the election, and Secondary steps down.
8	Election time	10 Seconds	Within 10 seconds the new election takes place

In case any of the following Hybrid Chat components fail on primary, the complete VM should be stopped to

Customer Channel Manager
Communication Server
Media Routing Engine
Chat-Server

Active MQ Failover Scenarios

No.	Scenario	Behavior
1	AMQ-1 is down while SITE-A is active	AMQ-2 will take over and all client requests will be processed by the same SITE-A instance because of its higher consumer priority.
2	Both AMQ-1 and SITE-A is down	AMQ-2 and SITE-B will start receiving requests. SITE-B will acquire all agent’s XMPP subscription and will start processing requests.
3	SITE-A restores	Connector-2 will continue to process requests until connectivity between the Client application and Connector-2 is lost or Connector-2 is down.
4	The link between Connector-1 and Connector-2 is down	Both connector instances serve requests independently.
5	AMQ-1 restores while SITE-A is still down and AMQ-2 is also down	The client will send a request to AMQ-1. AMQ-1 will send the request to SITE-B because SITE-A is down. Request Flow: Client-App-1 → AMQ-1 → SITE-B Response Flow: SITE-B→ AMQ-1 → Client-App-1
6	SITE-A is down while both AMQ are active	The request flow will be the same as no. 5
7	SITE-B is down while both AMQ are active	AMQ-2 requests will be redirected to AMQ-1 and GC-1 will handle all requests. Request Flow: Client-App-2 → AMQ-2 → AMQ-1 →SITE-A Response Flow: SITE-A → AMQ-1 → AMQ-2 → Client-App-2

Load Balancer

HA-Proxy is deployed as a single point of failure. You have the option to use any other load balancer other than HAProxy. However, if there is a requirement for High-Availability at Load Balancer level, a cluster of HAProxy based Load Balancer should be configured manually.

Rasa Bot Failover

Rasa Bot failover is beyond the scope of HC HA support.

SQL Server Failover

SQL Server failover is beyond the scope. The customer needs to handle SQL failover.

Facebook Connector Failover Scenarios

Scenario

Expected Behaviour

CCM is unreachable for Facebook

When CCM is not accessible to Facebook for any reason, Facebook retries for a finite number of times and marks it as dead afterward.

Restore link between Facebook and CCM

When the link is restored, the deployment engineer must re-register the CCM web-hook in Facebook to receive messages from Facebook.

CCM can, however, send enqueued messages to Facebook even without re-registration of the web-hook.

Facebook is unreachable for CCM

Facebook messages will continue to arrive via the CCM web-book. However, CCM makes indefinite retires until the message is delivered.

Viber Connector Failover Scenarios

Scenario	Expected Behaviour
CCM is unreachable to Viber	When CCM is not accessible to Facebook for any reason, Facebook retries for a finite number of times and marks it as dead afterward.
Restore link between Viber and CCM	When the link is restored, the deployment engineer must re-register the CCM web-hook in Viber to receive messages from Viber. CCM can, however, send enqueued messages to Viber even without re-registration of the web-hook.
Viber is unreachable for CCM	Viber messages will continue to arrive via the CCM webhook. However, CCM makes indefinite retires until the message is delivered.