Latest posts by Guy Fighel (see all)
- Decisions: a deeper look - July 20, 2018
- Introducing Decisions: smart correlation for busy ops teams - July 16, 2018
- AI and Machine Learning Powered Correlations for Prometheus - June 14, 2018
This is part two of our three part series that looks into creating a robust and effective alerting system for DevOps teams. The first in the series looked at the burden of false positives and how to avoid them. This article takes this to the next level and looks at how to structure alerts and test out the system design.
As a baseline to work from, pages and alarms should only sound for events that are urgent, important, and actionable. By verifying the level of importance of an alert, you can then eliminate alert fatigue, and ensure that every page is quickly investigated, acted upon, or fine tuned. The result being increased uptime and ultimately a happier On-Call team.
It’s All About Timing and Optimization: 5 Tips to Alert Success
An effective monitoring, metrics, and alert system is one of the fundamental tools to an efficient DevOps operation. When you are working with small, iterative, and, often, fast to production releases alerts to problems become a key requirement to maintain the production environment. Alert systems are the heartbeat of the entire operation, without which downtime will persist. Designing an alert system to be optimal, with minimal false positives, is the key to that effectiveness.
Here are my top 5 tips for ‘effective alerts by design’ (EAbD):
#1 The mindfulness of alerts: When an alert is pushed out to an email system or third party platform it can end up being missed. Instead of just passing the alert off, keeping it within workflow control will make sure it stays visible and has a valid lifecycle.
#2 Defcon 5 – Keeping it subcritical: Managing alerts will give you the control you need to focus on the important ones and not go off chasing unicorns. Write sub-critical rules for your system. These will be specific to your production environment, an example may be that your database is close to capacity. The rules can also be prioritized and alerts sent based on that priority. It means you don’t have to react to sub-critical events at 2 am.
#3 Fixing the symptom, not the cause: Keeping alerts consistent, even when the underlying architecture changes, is the art of the possible if you page on the symptoms, rather than the cause. It is easier to capture problems using user-facing symptoms or other dependable services.
#4 Keep it Simple Simon (KISS): Create scope aware alerts. This will allow you to combine variables, so that instead of two or more alerts for one object, you get a single alert. For example, for alerts on disk usage which is split into forecast and usage levels – combine these two into a single alert.
#5 Putting false alarms to good use: When you do get false alarms, put them to work by using them as a basis for tightening up the alert condition or removing it from the paging list.
When designing your effective alert system, the use of an expressive language, rather than simple object/value UI widgets is key; it provides more flexibility and reduces errors.
Extending the Structure of Alerts
The art (and science) of alerts extends to their structure too. Human curated alert rules should be the baseline upon which your structure depends on. Designing the structure of the alert is down to some basic prerequisites including:
- Natural grouping of environmental components
- The correct aggregation
- The correct information attached to the alert to give the most detail
- Combining metrics to simplify alerts, whilst maintaining maximum detail
- Use Boolean conditions such as negative events (look for things that are NOT happening which might lead into a problem)
- Avoid using fixed alerts – give yourself flexibility and build in historical context to provide predictive analysis
Fine Tuning Alerts
Like all good ideas, your efficient alert system needs to be tested. Simulation is a good place to start, by creating simulation rules based on previous events. The goal is to reduce the noise. Reducing noise means you’re more likely to produce relevant alerts. Simulation and noise reduction is not a one-off event. You need to continue to carry out these exercises, fine tuning your alerts until you have the most meaningful alerts. And of course, reviews should be periodic as environments change. I also suggest to make it a habit every week (before the weekend starts) to review how was the false positive ratio was during in the previous week. Spending an hour for tuning before the weekend can save you and your team a great headache during the on-call weekend shift.
Similarly, paging events should be reviewed, including those ignored by administrators – this data can help you to refine rules to prevent false positives.
Fine Tuning Top Tips
Here are some tips for fine tuning your alerts so you can make sure they’re spot on, and as effective as possible.
#1 A Rule of rules: Alerts that are less than 50% accurate are broken; rules with a 10% false positive threshold are OK to go.
#2 A page too far: Get rid of extraneous paging events. If a page has fired, and when investigated shows nothing wrong – adjust the rule.
#3 The rise of the machine: Machine learning is perfectly placed to optimize alerts. Use human curated rules, enhanced with machine learning algorithms to create rules and fine-tune alerts.
#4 Repeat business: Take regular events, such as backups, into account when fine-tuning rules. If you have known maintenance going on, suppress alerts associated with that.
#5 Keeping Control: Set metrics for the on-call team and limit them to a set amount, through review, by differentiating between generated events and the events triggered by them.
Why It Pays to Structure Alerts Properly
The art of creating effective alert systems is down to using an intelligent approach. Keeping things simple, combining variables, and dampening down noise, coupled with prudent and mindful testing, will naturally result in improved alerts. Adding into that mix machine learning based on human curation will allow you to develop an optimized alert system that works for you, rather than against you.
In our final post in the series, we will look at improving alert response times. Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too. Stay tuned.