Guy Fighel

Co-Founder & CTO at SignifAI
Guy Fighel is the Co-founder and CTO of SignifAI. He's accumulated 18+ years of experience in system & software architecture and DevOps practices and has been involved in leading the development of highly scalable, global mobile software solutions in international companies, such as Tango and Vonage. He has also overseen the development of more than 16 patents.
In an earlier post this week, we introduced SignifAI Decisions  – the most flexible, intuitive correlation engine for SRE and DevOps teams. Beyond activating automatically generated logic and building basic Decisions to reduce alert noise, there are a whole host of features that help create stronger correlations and give you deeper insight into your production system. Let’s explore a few of them…

Consolidate events with anomaly detection
In the advanced mode of SignifAI’s Decision builder, you can specify a timeframe and minimum number of incidents for a Decision. The timeframe is the maximum amount of time between incoming incidents in order for them to be correlated based on the logic you specify in the builder. Specifying a shorter timeframe for broader and more generic logic and a longer timeframe for very narrow logic can help ensure the accuracy and relevance of your correlations. 

You can also specify the minimum number of incidents that need to match the Decision logic before being correlated. This is useful in cases where a large number of incoming incidents, or a spike in the average incident volume, would indicate a correlation – for example, in the event of a datacenter outage, when several outage or unresponsive incidents could be created for totally separate applications that are all dependent on the same datacenter.

Change Issue priority based on a correlation event
Individual incidents usually have a priority specified by their source – for example, outage events may be classified as “critical,” while minor blips in performance could be “medium.” However, when several incidents are correlated, the meaning of the event could change. In cases like these, you may want the priority of the correlated Issue to be automatically updated in order to reflect this insight. For example, a spike in low-priority incidents could mean an underlying, higher-priority problem. In the Decision builder, you can specify the priority of Issues that are correlated based on your logic – by default, SignifAI uses the higher priority between the incoming incidents.
 

Specify the right measure of similarity for your use case
A similarity algorithm that works perfectly for short strings, like healthcheck names or event sources, could be counterproductive for comparisons between longer blocks of text (eg. descriptions). The SignifAI Decision engine gives you the flexibility to choose the right measure of similarity for your logic, with many options for different use cases – check out this blog post to find out more about each one.

Compare entire incidents with multiple algorithm options, including machine learning
With SignifAI Decisions, you can create broad logic at the entire incident scale that works seamlessly with your attribute-specific rules. Choose between two correlation metrics:
  1. Jaccard similarity – you can compare the similarity of entire incidents based on keyword matches. Read more about the implementation of jaccard distance here.
  2. Categorical clustering – this option uses a machine learning algorithm to continuously determine clusters of similar events among all incoming data. In order for incidents to be correlated based on this operator, they must satisfy two conditions: 1) They must be in the same cluster, and 2) The distance between each incident and the center of the cluster must be less than the threshold distance.

Let’s break those down a little further. The algorithm continuously categorizes incidents into clusters based on an intelligent combination of attribute values and time series data. The center of each cluster is the incident which represents the mode (highest frequency of common attributes) of all the values in the cluster, and the distance from the center indicates how similar each incident is to that mode. In this visualization, each color represents a different cluster:

The threshold you set in the rule builder is normalized to represent the distance between an individual incident and the center of its cluster. A high threshold represents a short distance (high correlation) between two incidents in a cluster. 

In the following example, let’s assume incidents 1-4 match all of your attribute-specific logic (eg. source equals nagios). Using categorical clustering, if you set a threshold of 95%, no incidents would be correlated. A threshold of 75% catches incidents 1 and 2, and a threshold of 0% catches incidents 1, 2, and 3. Incident 4 would never be correlated, since it’s not in the same cluster. 

Categorical clustering is a great approach for incidents that may look fairly different at the attribute-specific level, but are related at the entire incident/time series data level. For example, incidents 2 and 3 may not be correlated by a high-threshold Jaccard distance, but could be correlated with categorical clustering.
You can learn more about categorical clustering algorithms here.

Stay tuned for another post about deep clustering, another comparison algorithm that’s currently in private beta.

Automatically compare custom or dynamic attributes, no tagging required
With subtree logic in the SignifAI Decision builder, you don’t need to know the exact name of each event attribute – you can build flexible logic based on just a prefix string, like aws. No need to spend hours tagging individual attributes, or updating individual Decisions if your event schema changes – it just works.

Leverage automatic NLP classification to determine alert symptoms
SignifAI’s NLP classifier runs machine learning algorithms over all of your incoming data in order to determine the best-matching predefined classes for each event, and then exposes those classes in the Decision engine to provide context for stronger correlations.
The classes and subclasses that SignifAI automatically identifies are “symptoms” of your production system, with the highest-level classes defined by the “4 golden signals” described in the Google SRE book, plus one more (availability):
  • errors
  • load
  • latency
  • saturation
  • availability
…and because the machine learning engine is trained on an SRE/DevOps incident dataset, you don’t need to do any tagging or configuration work to use these classes. They’re available in the rule builder or in Suggested Decisions automatically.

Now that you’ve had a taste of the power and flexibility of Decisions, why not take it for a spin? Your team could be benefiting from automatic correlations and dramatic noise reduction in only a few minutes – sign up for your free trial here.