Guy Fighel

Co-Founder & CTO at SignifAI
Guy Fighel is the Co-founder and CTO of SignifAI. He's accumulated 20 years of experience in system & software architecture, applied AI & ML and DevOps practices and has been involved in leading the development of highly scalable, global software solutions in international companies, such as Tango and Vonage. He has also overseen the development of more than 20 patents.

In our previous two posts on the art of designing an effective alert system, we have discussed how to reduce false positives and how to create structured alerts that can be optimized through intelligent test protocols. In this, our final post in this series, we will look at improving alert response times.

Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers too.

There are a number of methods that, when combined, help you improve response times. Here we’ll take a look at some top tips for improving response times through an intelligent approach to notifications.

Writing Effective Alerts –  Things to Consider

Our first method to reduce response times is to generate well-written alerts. If your alert notifications give your support team a clear understanding of an issue, the team can prioritize important alerts and focus on them. Here are my top tips to writing alert notifications that your teams will respond to:

Tip #1 Clear and present danger:  When creating your alert notification template make sure you write a very clear description so that your  team, even blurry-eyed in the middle of the night, can understand the importance level.

Tip #2 In summary…: Setup a focused summary of all of the possible cause based rules. It needs to be precise and brief – enough to allow the reader to skim and spot if the cause is already a known issue.

Tip #3 Do the work for them: If you know that certain alerts will require certain information to resolve the issue, add that information in to the body of the alert for reference. Links to your favorite knowledge management system or runbook will be highly appreciated by the on-call person.

Tip #4 No such thing as too much info: Add in additional information such as internal Wiki links and other ticket responses that are pertinent to the notification.

How to Use Policies to Improve Alert Response Times

Preparing notifications using pre-completed information is one way to improve response times. Another is to apply policy settings to notification generation. The top 4 tips for the type of policies which help to cut response times are:

Tip #1 Be negative: Allow your teams to negatively acknowledge alerts. This creates a domino effect and moves the alert onto the next person.

Tip #2 Be a thoughtful fighter: The fighter pilot, John Boyd developed a framework for thinking known as OODA or ‘observe, orient, decide and act’. It is a process based method of thinking about a problem and can be applied effectively to your on-call engineers.

Tip #3 Aggregation for information: Create incident threads from related incoming incidents by aggregating them into a single thread. It makes notifications more clear and relatable.

Tip #4 Silence of the logs: Silence redundant alarms and log them.

Protocols and Response Time Optimization

Optimizing alert notifications to improve response times using clear outlines and sound policies can be augmented by choosing the right channel protocol to pass those alerts through. The types of channels used for communicating alerts vary in the immediacy of the method. Channel protocol choices in order of responsiveness (and annoyance) are:

  • Dashboard
  • Email
  • Chat (Slack/HipChat etc.)
  • SMS/Page
  • Phone call

Which you choose as your protocol during configuration will impact the response time. The balance between getting the right level of response / response time, and preventing the burden / annoyance of a false alarm, can be accommodated by good testing. Test against the history of the alert and start off with a more passive protocol for communication of that alert – if it ‘behaves’, and doesn’t generate false positives, then it can go up the protocol ladder to the next channel level.

Alert Taxonomy

Taxonomy is all about classification, and classification is all about making things easier to understand allowing us to see patterns and shared characteristics.

Alerts can be classified into several areas:

  1.       Severity levels
  2.       Alert states
  3.       Alert notification criticality
  4.       A miscellaneous sub-set that classifies unactionable alerts

Having a coherent alert taxonomy can give you a tool to apply policies and protocols, making your overall alert system well designed, effective, with appropriate response times that optimize system uptime. This ensures teams are responding to the right alert, at the right time, in a timely manner.

The three posts in our series have been written as a basis for you to build an effective alert system. The art (and science) of creating a robust, effective and efficient alert system is a fundamental tool in a successful DevOps operation. These posts are a collection of our knowledge and experience in working within a highly scaled production environment with expectations of optimized uptime. Using these posts as a basis for your own optimization exercises will give you insight into the areas that are the most important in the alert system design process. We believe that a well-designed alert system will give you a well-oiled reliability engineers team, resulting in improved uptimes, lower cost management, and increased team member retention. What’s not to like about the art of structuring alerts?!