Prometheus/ Grafana Alerts

Alert Manager

Alert management is also available in Expertflow Monitoring Solution, where customers can configure their alerts through different channels. Initially, there are 2 channels provided for alerting, but this can be extended to different platforms for extended alerting.

Email.
Webhooks ( Google Chat, or any other webhook-supported platform ).

Dashboards

A Dashboard is a combination of multiple panels, where each panel is serving as a graphing plot for a unique key value that a given service is delivering. This allows for multiple Panels placed inside a Dashboard, giving the end-user a possibility of monitoring multiple panels simultaneously.

Currently, there are different types of built-in dashboards available as default and the end-users can add/edit additional panels, based on the requirement, some of these provide intensive details of many aspects of the system being monitoring like

The following are the dashboard types:

Dashboards Types	Description
Business Dashboards	Business Dashboards giving an overview of business-related Graphs which may involve costing, capacity calculation.
Technical Dashboards	Technical Dashboards giving an overview of the capacity and functionality of components providing basic services in Expertflow Hybrid-Chat.

Alerting

Alert management is also available in Expertflow Monitoring Solution, where customers can configure their alerts through different channels. Initially, there are 2 channels provided for alerting i.e. by Email and by Google Chat but this can be extended to different platforms for extended alerting. Other additional options can be set up like OpsGenie, PagerDuty. A complete list of these channels is available at the Grafana WebSite (here).

By Default Expertflow has a standard alerting enabled for the amount of memory being used by any of the components, including CPU usage as well. Also, there are alerts enabled for a service down the instance, which is triggered when a specific service is down and becomes unresponsive.

The components responsible for alerting are:

AlertManager
Grafana (Setup only in Graph Panels)

Service-related alerts are maintained at the system level using an open-source component ‘AlertManager’ which is part of the Expertflow Monitoring Solution. A service restart of the Prometheus solution is needed whenever there is a change needed in the AlertManager.

Alerting via Grafana Panels is also available for raising alerts for the individual statistics and requires a supported type of panel i.e. Graph. So when a certain value is crossing the threshold configured in the graph panel, it will generate the alert on the configured channel such as Email, Slack, etc.

Alerts set up in AlertManager

The thresholds/conditions on which the alerts are set up in AlertManager are the following:

Sr. No	Threshold Condition
1	If any target service defined for metrics scraping is down.
2	If any container is down on the Hybrid Chat node(s).
3	If CPU usage on Host is exceeding the threshold value.
4	If memory usage on the host is exceeding the threshold value.
5	If the host file system storage is receding the threshold value.

Alerts set up in Grafana

Technical Alerts

The thresholds/conditions on which the alerts are set up in AlertManager are the following:

Sr. No.	Threshold Condition
1	If a container is using more CPU than the threshold value defined.
2	If a container is using more memory than the threshold value defined.
3	If the host file system storage is receding the threshold value.
4	If memory usage on the host is exceeding the threshold value.

Business Alerts

These alerts are set up specific Graph panel in Grafana Dashboard.

Sr. No.	Threshold Condition
1	If the Inbound Process Duration is exceeding the threshold value.
2	If the Outbound Process Duration is exceeding the threshold value.

Sample Alerting Threshold

Sample Alerts threshold in Grafana:

Alerts at Hybrid Chat (Primary & Secondary):

When host memory is exceeding the threshold value.	Alert Threshold: 7 GB
When host file system storage is receding the threshold value.	Alert Threshold: 28 GB
If any container is using more CPU resources than the specified value.	Alert is raised.
If any container is using more memory resources than 3 GB.	Alert is raised.

Alerts at RASA (Primary & Secondary):

When host memory is exceeding the threshold value.	Alert Threshold: 6 GB
When host file system storage is receding the threshold value.	Alert Threshold: 14 GB
If any container is using more CPU resources than the specified value.	Alert is raised.
If any container is using more memory resources than 3 GB.	Alert is raised.

Alerts at Arbitrator:

When host memory is exceeding the threshold value.	Alert Threshold: 6 GB
When host file system storage is receding the threshold value.	Alert Threshold: 14 GB
If any container is using more CPU resources than the specified value.	Alert is raised.
If any container is using more memory resources than 2 GB.	Alert is raised.

Sample Alerts in alertManager

If any of the container is down.	Alert is raised
If any of the target services defined for metrics scraping is down.	Alert is raised
If any of the nodes is under high load (for the past 1 minute) .	Alert is raised
If any of the nodes (including HC, RASA, and Arbitrator) is having memory utilization of more than 85 % of the total memory.	Alert is raised
If any of the nodes (including HC, RASA, and Arbitrator) is having storage utilization of more than 85 % of the total storage.	Alert is raised