Failover Scenarios - High Availability
Failover scenarios
- System Failover
- Component Failover
System Failover
Normally, the High Availability(HA) setup is configured to forward and serve all the user traffic from the Primary System. HAProxy servicing as a load balancer and aggregator on top of both Primary and Secondary always takes care of the failover in such a way that it automatically shifts traffic load to the Secondary system in case the primary system goes down. This failover is automatic and does not require any manual effort involved. By design, there are multiple services ( MRE, Chat Server, and Communication Server ) stopped explicitly at the secondary system, and in case of failure at the Primary location, the Secondary system has to be prepared for traffic. This is accomplished by running a set of procedures on the Secondary.
all the HA related scripts mentioned below are located in <deployment-path>/HAScripts folder
- Renounce Scripts are to be run on the machine, that you want to shut down.
- Takeover Scripts are to be run on the machine, that you want to be active.
Scenario | Expected | Actions |
---|---|---|
Primary Down | All traffic should be routed and served at the Secondary System | Execute these scripts on the Secondary system to make it active and Primary. 1) takeover.sh |
Primary Recovers | Secondary System should be renounced in favor of Primary | After the Primary system recovery, follow these procedures to make is active. On Secondary 1) renounce.sh On Primary and then either restart the complete solution using 1) efutils all up or 1)takeover.sh 2)takeover_reporting.sh |
Renounce Scripts is to be run on the machine, that you won
Component Failover
MongoDB
Assuming that MongoDB is set up in a 3-node replica-set with Arbiter running on a separate VM.
If | Expected | Result | |
---|---|---|---|
1 | Primary Mongo instance is down | Failover occurs with R/W operations from Secondary (now elected as Primary) | |
Arbiter is down | The election is halted – Primary continues R/W operations. Secondary serves only READONLY. | ||
3 | Secondary Mongo instance is down | R/W operations continue from Primary. Once Secondary recovers it synchronizes data from PRIMARY. | |
4 | Both Primary and Arbiter mongo instances are down | Secondary serves READONLY mode | |
5 | Network-link of Secondary is down | Primary continues the R/W operations with Secondary disconnected. The secondary will be in READONLY mode. | |
Network-link of Secondary is restored | Upon link restoration Secondary automatically synchronizes data from Primary. | ||
6 | Network-link of Primary is down | The secondary is elected as Primary. | |
Network-link of Primary is restored | When the link is restored, the PRIMARY with id '0' is given precedence in the election, and Secondary steps down. | ||
8 | Election time | 10 Seconds | Within 10 seconds the new election takes place |
In case any of the following Hybrid Chat components fail on primary, the complete VM should be stopped to
- Customer Channel Manager
- Communication Server
- Media Routing Engine
- Chat-Server
Active MQ Failover Scenarios
No. | Scenario | Behavior |
1 | AMQ-1 is down while SITE-A is active | AMQ-2 will take over and all client requests will be processed by the same SITE-A instance because of its higher consumer priority. |
2 | Both AMQ-1 and SITE-A is down | AMQ-2 and SITE-B will start receiving requests. SITE-B will acquire all agent’s XMPP subscription and will start processing requests. |
3 | SITE-A restores | Connector-2 will continue to process requests until connectivity between the Client application and Connector-2 is lost or Connector-2 is down. |
4 | The link between Connector-1 and Connector-2 is down | Both connector instances serve requests independently. |
5 | AMQ-1 restores while SITE-A is still down and AMQ-2 is also down | The client will send a request to AMQ-1. AMQ-1 will send the request to SITE-B because SITE-A is down. Request Flow: Client-App-1 → AMQ-1 → SITE-B Response Flow: SITE-B→ AMQ-1 → Client-App-1 |
6 | SITE-A is down while both AMQ are active | The request flow will be the same as no. 5 |
7 | SITE-B is down while both AMQ are active | AMQ-2 requests will be redirected to AMQ-1 and GC-1 will handle all requests. Request Flow: Client-App-2 → AMQ-2 → AMQ-1 →SITE-A Response Flow: SITE-A → AMQ-1 → AMQ-2 → Client-App-2 |
Load Balancer
HA-Proxy is deployed as a single point of failure. You have the option to use any other load balancer other than HAProxy. However, if there is a requirement for High-Availability at Load Balancer level, a cluster of HAProxy based Load Balancer should be configured manually.
Rasa Bot Failover
Rasa Bot failover is beyond the scope of HC HA support.
SQL Server Failover
SQL Server failover is beyond the scope. The customer needs to handle SQL failover.
Facebook Connector Failover Scenarios
Scenario | Expected Behaviour |
---|---|
CCM is unreachable for Facebook | When CCM is not accessible to Facebook for any reason, Facebook retries for a finite number of times and marks it as dead afterward. |
Restore link between Facebook and CCM | When the link is restored, the deployment engineer must re-register the CCM web-hook in Facebook to receive messages from Facebook. CCM can, however, send enqueued messages to Facebook even without re-registration of the web-hook. |
Facebook is unreachable for CCM | Facebook messages will continue to arrive via the CCM web-book. However, CCM makes indefinite retires until the message is delivered. |
Viber Connector Failover Scenarios
Scenario | Expected Behaviour |
---|---|
CCM is unreachable to Viber | When CCM is not accessible to Facebook for any reason, Facebook retries for a finite number of times and marks it as dead afterward. |
Restore link between Viber and CCM | When the link is restored, the deployment engineer must re-register the CCM web-hook in Viber to receive messages from Viber. CCM can, however, send enqueued messages to Viber even without re-registration of the web-hook. |
Viber is unreachable for CCM | Viber messages will continue to arrive via the CCM webhook. However, CCM makes indefinite retires until the message is delivered. |