Latest posts by Guy Fighel (see all)
- Decisions: a deeper look - July 20, 2018
- Introducing Decisions: smart correlation for busy ops teams - July 16, 2018
- AI and Machine Learning Powered Correlations for Prometheus - June 14, 2018
This post is part 1 of 3 in our series, ‘The Art of Structuring Alerts’
The consequence of letting alert levels spin out of control
There is a great case of alert fatigue documented by an institution in a completely different field. In 2013, the Boston Medical Center was experiencing a higher level of deaths due to mistakes in medical processes followed during hospital stays and visits. Their investigation traced a number of those deaths to the desensitization of nurses to alerts. Hospitals use monitoring and alert systems similar to those in technical operations environments, and in an hospital environments nurses are the equivalent of our NOCs or SREs depending on their level of experience. The nurses’ stations were being constantly bombarded by alerts, many of them duplicates or of low levels of urgency. Naturally, with little ability to sort out a very high volume of alerts by importance or eliminate unnecessary ones, nurses started to “suppress” most of these. Important alerts related to true medical emergencies were lost in the mix and being ignored, with dire consequences. The only way to resolve this problem was to put in place a system of intelligent management of alerts. The system was able to focus only on what mattered and the medical issues went away. We believe that this case and the lessons learned from the fix are directly relevant to the world of technical operations.
How did we get so many alerts?
Alerts are an essential part of the site reliability engineer toolkit. If implemented effectively, they directly reduce the mean-time-to-remediation (MTTR) of production issues. However too many alerts can overwhelm SREs and have the opposite effect, they start functioning as red herrings and become more of a distraction.
Alerts spike up upon the implementation of any monitoring application. When a team installs or develops internally an application performance management system, it will typically set a number of alerts throughout that system for the purpose of reacting or paying faster attention to specific changes in the environment. The same process is repeated for the networking, infrastructure, security, database, ISP, CDN, cloud, user, revenue metrics etc. monitoring tools. Each monitoring application will be configured individually to generate alerts focused on the part of the system it tracks. Typically, alerts will be repeated every few minutes until acknowledged. That’s a lot of alerts in and of itself. They all relate to the same system though. What happens when a problem in a part of the system reverberates on other parts of the system, creating a chain reaction effect? All of a sudden alerts related to the same issue are fired-up by all the individual tools only monitoring a portion of the system, creating a lot of noise and effectively masking the real cause of the issue.
Here is an example: let’s say that you have a cluster running on AWS or Google Cloud that is part of an autoscaling group. This application is behind a load balancer and is receiving external HTTP requests. Suddenly, your application will receive much more traffic and the load will rise. In that case, assuming you have set your scaling policy correctly, the application will scale and no issues will come out of it. However, what if, due to a human error, you didn’t set the scaling policy correctly? What will happen?
- You will start getting Low Apdex alerts from your APM system
- You will start getting CPU/Memory alerts from your hosts
- You will receive CloudWatch alerts from the Load Balancer with lots of 4xx responses
- If you have set an external health check on your app – such as Pingdom checks, you will receive alerts from that system on load time increase or failure in responses
- Your application logs might increase with lots of error logs as your application can not keep up with the load and the increased error rate – this means that your disk space is going to decrease – so another error will come from those hosts with loaded disk space.
- Since your system will write much more logs, you could also receive Disk I/O alerts
- In case of custom health checks or custom monitoring checks – those will fail since the application is no longer responding – so, even more alerts on custom checks failure
- If the application is connected to a DB or a Cache layer, and that layer is also not scalable, pushing more requests without any implementation of circuit breakers will push the load on the DB layer. So in that case, more alerts will be received on slow queries, or high load.
All of that alert activity is generated not because of a load issue, but because someone didn’t configure the scaling policy correctly…
Alerts have also a tendency to creep in over time. A one-time production issue may lead someone to design an alert specifically to prevent that issue from happening again: the alert is based on a specific threshold or context that makes sense based on the issue that just happened. After a while though, the specific issue never comes back, it was a one-time event, but the alert is still there flagging conditions that are perfectly normal, it’s become a false positive, creating confusion as nobody remembers what it is supposed to flag.
Impact of alert overload
Fortunately, alerts will not be the cause of deaths like in the case of the Boston Medical Center. But they have a very real direct economic impact.
Too much noise level increases the complexity of maintenance of the system, typically the MTTR increases because of the difficulty of finding the root cause of issues, and the mean-time-to-failure (MTTF) decreases because teams spend more time on reactive mode than on the proactive side. Those two metrics directly affect system availability. They also have a very direct impact on availability or system performance, and for every minute of downtime there is typically a direct loss in revenue.
Alert fatigue is a also a very real thing. We will discuss in a future post the human impact on teams’ morale and stress, and by extension the impact on the Company.
How to deal with the issue or alert overload?
Having an intelligent approach to the design of an alert system is the first step towards getting false positives back under your control. Different machine learning classification techniques might help with this problem, but there is no real silver bullet here. Many good statistical models and approaches exist and they all give you a toolset to generate rules based on understanding and learning from your system output. At the end of the day algorithms and approaches like machine learning and deep learning, are all quite important, but they are only tools in a box. These tools, human-curated, determine outcomes and can be used to train the system to generate a more accurate picture. One of the key areas that machine intelligence based alert systems focus on is to precisely assess and classify events. Setting events as critical, but also having the flexibility to handle non-critical events that, if dealt with, can prevent them becoming critical. All of this, of course, needs to be done within an environment of superb UX that gives readily digestible visualization of events but also offers improved response times through well composed alert notifications. An important aspect of machine intelligence generated alerts is that they are flexible and can continue to be tweaked to create even more precision. This is especially important as production environments are not static entities.
This package of machine intelligence driven smart alert systems will increase your ability to reduce the noise. Reduced noise allows people to focus in on the root cause, saving time, and creating much greater work satisfaction and control – not just in DevOps but throughout the whole of the organization.
Fighting Fire with Machine Intelligence
We have come to a juncture in technology development and we need a better way of managing and controlling alert systems. Fault management, the identification of and response to abnormal conditions, is a major component of the human’s role as supervisory controller of dynamic systems. Machine intelligence toolset is a leap forward in alert system design and configuration. It is a way to truly dampen down the noise, while prioritizing the important alerts. Without it, as our production environments become ever more distributed and complex, we may find that we are drowning in those false positives.
In our next blog in this series, we’ll start to take a look at the art of structuring alerts and how to test them out.