Automatic correlations can keep SRE and DevOps teams focused on what’s important by reducing noise and providing important context to incoming events. In order to parse through and determine correlations among millions of events, the SignifAI Decision engine supports a suite of algorithms to fit any use case or custom logic. Today, let’s go back to the basics and break down seven of these similarity measures.

1) Levenshtein distance

The Levenshtein distance, also known as edit distance, between two strings is the minimum number of single-character edits to get from one string to the other. Allowed edit operations are deletion, insertion, and substitution. Some examples:
  • number/bumble: 3 (numberbumber → bumblr → bumble)
  • trying/lying: 2 (trying → rying → lying)
  • strong/through: 4 (stronghtrong → throng → throug → through)
Some common applications of Levenshtein distance include spelling checkers, computational biology, and speech recognition. The default similarity threshold for new SignifAI Decisions is an edit distance of 3 – you can change this in the Advanced mode of the Decision builder.
2) Jaro-Winkler distance
This metric uses a scale of 0-1 to indicate the similarity between two strings, where 0 is no similarity (0 matching characters between strings) and 1 is an exact match.
Jaro-Winkler similarity takes into account:
  • matching: two characters that are the same and in similar positions in the strings. 
  • transpositions: matching characters that are in different sequence order in the strings.
  • prefix scale: the Jaro-Winkler distance is adjusted favorably if strings match from the beginning (a prefix is up to 4 characters). 
This is a useful metric for cases where identical prefixes are a strong indication of correlation.

3) Longest common subsequence distance
Longest common subsequence distance (LCS) is a variation of Levenshtein distance with a more limited set of allowed edit operations.The LCS distance between two strings is the number of single-character insertions or deletions to change one string into another. 

LCS is most useful in situations where the characters that belong to both strings are the most important – in other words, if there’s a lot of “garbage” or noisy characters in your strings, LCS is a useful metric, since it concentrates on the shared characters to determine similarity. 

4) Jaccard distance
Jaccard distance, sometimes referred to as the jaccard similarity coefficient, is one of the most simple measures of similarity to understand – the index, denoted as a percentage (0 being completely dissimilar; 100 being very similar) is calculated with the following formula:
(# of characters in both sets) / (# of characters in either set) * 100

In other words, the Jaccard index is the number of shared characters divided by the total number of characters (shared and un-shared). In the SignifAI Decision builder, you can use Jaccard similarity to compare entire incidents and set a similarity threshold (10 – 99%). 

Jaccard distance is very easy to interpret and especially useful in cases with large data sets – for example, in comparing the similarity between two entire incidents (as opposed to one attribute). It is less effective for small data sets or situations with missing data.

5) Hamming distance
One of the similarity measures SignifAI uses in the Decisions engine is the Hamming distance. A simpler version of “edit distance” metrics like Levenshtein distance, the Hamming distance between two strings is the number of substitutions required to turn one string into the other.

Hamming distance is computed by counting the number of positions with different characters between the two strings. For example, in the strings below, the Hamming distance is 2 – notice the underlined “different characters”:

flowers/florets
Hamming distance requires the compared strings to be of equal length. This is a useful similarity metric for situations where the difference between two strings may be due to typos, or where you want to compare two attributes with known lengths (eg. an incident ID or an instance host or pod name in a specific convention). 

6) Cosine distance
Cosine distance is another similarity metric the SignifAI Decision engine uses to correlate related issues. It’s most commonly used to compare large blocks of text (for example, incident descriptions) and provides an easy visualization of similarity.

To obtain the cosine distance between two text blocks, a vector is calculated for each block to represent the count of each unique word in the block. The cosine of the two resulting vectors is the cosine distance between them. For example, take these two strings:
It is not length of life, but depth of life.
Depth of life does not depend on length.

Here are the word counts for these sentences:
it 1 0
is 0 1
not 1 1
length 1 1
of 2 1
life 2 1
but 1 0
depth 1 1
does 0 1
depend 0 1
on 0 1

…and here are those counts represented as a vector:
[1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 0]
[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1

The angle between these vectors is about 0.9 (~52 degrees). A cosine distance of 1 is the highest similarity. 

As you can see, this metric is less useful for situations where small character differences in words are insignificant, eg. typos – “nagios” and “nagois” would be treated as completely different words. 

Also, cosine distance ignores word order in the text blocks – for example, these two sentences have a cosine similarity of 1 even though they read completely differently:
this is the first of the sentences, it is unscrambled
the sentences is unscrambled, the first of this it is

7) Fuzzy score
One of the similarity metrics that the SignifAI Decision engine can use to correlate related incidents is fuzzy score. Between two strings, a high fuzzy score indicates high similarity.

The fuzzy score algorithm works by allocating “points” for character matches between strings:
  • One point for each matching character
  • Two bonus points for subsequent matches
Example: SignifAI / sgnfai
s: 1
g: 1
n: 1
f: 1
a: 1
i: 1
gn: 2
fa: 2
ai: 2
= 12 points

Fuzzy score is most useful for relatively short strings. 

In addition to the similarity metrics described here, the SignifAI Decision engine also uses advanced machine learning techniques like automatic NLP classification, categorical and deep learning clustering, and more to correlate incidents – stay tuned for a future post to learn more.

With the Decision builder, the flexibility and power of these tools is at your fingertips, in an intuitive, beautiful interface. Want to check it out for yourself? Sign up for your FREE TRIAL here.

Annika Garbers

Product Manager at SignifAI
Annika works with the Product team at SignifAI. Her background is in project management and process improvement for DevOps and SRE teams of all sizes.