Latest posts by JP Emelie Marcos (see all)
- Learnings from My Discussion with Jennifer Tejada, CEO of PagerDuty - June 5, 2018
- Increase IT Operational Efficiency with SignifAI - April 26, 2018
- Learnings From My Discussion with Eric Yuan, CEO and Founder of Zoom - November 14, 2017
This post originally appeared on DevOps.com.
“Predictive analytics” refers to the many techniques that are used to analyze data in order to make predictions about the future. These techniques include data mining, statistics, modeling, machine learning and artificial intelligence. Being able to predict technical and business outcomes is one of the main promises of “big data.” Indeed, many businesses have invested heavily in infrastructure in order to anticipate key performance indicators such as demand, pricing or maintenance. The investment required is substantial but, when it works, it yields a very positive ROI.
In this post, we’ll examine why implementing predictive analytics in a Site Reliability Engineering (SRE) context requires a significant investment in both setting-up the right data sets, as well as, applying the right logic to that data.
SRE should in theory be a prime target area for predictive analytics. Why? Because companies have already widely adopted monitoring tools over the past decade. In doing so, they’ve amassed extensive system health and behavior data that is ripe for analysis. SRE teams have also increased their ability to make sense of this data as data science tools have become more accessible. Basic data science aptitude is increasingly become a desired skill when hiring engineers. Finally, the site reliability SRE team is responsible for a mission critical KPI: system availability. This single metric is directly tied to customer SLAs in digitally dependent businesses. SRE teams have a strong incentive to implement predictive analytics. Why? Having the ability to prevent downtime by addressing potential issues before they hit the production environment is hugely valuable. More system availability translates into more revenue, higher customer satisfaction and lower costs of problem mitigation.
However, the reality is that an SREs job is still mostly focused on the diagnostic, mitigation and “fixing” parts, as opposed to more proactive tasks. This is in big part because predictive analytics applied to SRE/DevOps has proven difficult to implement and perform in a disciplined manner. Therefore it is still by and large, an aspirational goal at most companies.
Up next, let’s explore the four challenges that need to be overcome in order to implement high quality predictive analytics practice in an SRE context, including:
- Data collection
- Data quality
- Data volume
- Model usability
Challenge #1: Data Collection
First off, the working data sets need to fit what you are trying to predict. In this case, it is challenging to get the right data set because the SRE team’s mission is very broad. Specifically, ensuring the reliability of the entire system. The difficulty lies in the fact that there is no such thing as “full system” data readily available. All the relevant data is distributed across siloed, or overlapping repositories and monitoring tools. Centralizing the data requires either a full system agent (which takes a lot of work to set-up) or it can be done through APIs, which requires mastering a multitude of API integrations. An additional obstacle in getting the right data lies in the inconsistent nature of events, time series data and logs. These disparate data types will not be able to be analyzed together, unless they can be transformed into a uniform data set.
Challenge #2: Data Quality
SRE environments emit lots of data and some can be values which do not make sense for the purpose of the analysis. Whichever data collection method is used, the information tracked often reflects some level of bias on the part of whoever setup the tracking mechanism in the first place. This leads to capturing both irrelevant data or inadvertently leaving out important information. In other words, the data is often both noisy and lacking at the same time. The key to successful modelling lies in the selection of prediction variables with data that is obtained before the event predicted happens (in this case, a failure). Defining the variables that lead to the determination of a sub-system failure is critical and requires specific domain knowledge. It is very hard to define and extract a pure data set sanitized of noise. The more data you leverage, the higher the chances that you are actually “feeding the beast” with irrelevant information. Too much irrelevant data leads to inaccurate models and false positives.
Challenge #3: Data Volume
Here you can have two different types of challenges: too little data and too much.
Not enough data
You typically need at least 100 instances of what you are trying to predict and probably at least 100 where it didn’t work to train a model. This assumes the data set is clean and directly relevant to a very narrow, specifically defined problem. In the case of SRE, the scope of issues that can affect a system is such that the data set required is very substantial. Likely, it’s several months worth of a wide variety of system behavior measurements. For example, if you collect only data through passive collection, think about the lag time between t-zero, when you start collecting data, and the time when you would have results. During that time, you have a bootstrapping problem, ie. your model can’t be trained to predict accurately what you want.
Too much data
There are two practical points to make around the notion of “too much data.”
- Lack of standardization: Processing large volumes of data for analysis remains a hard engineering challenge. There is no standardization for data pipeline, every method has a limit in terms of how many inputs it can ingest on a per-second basis, and ingest rates often end-up becoming bottlenecks in large enterprise environments.
- Volume does not equal accuracy: There is a point beyond which you get diminishing returns: statistical theorems show that after a certain point, feeding more data into a predictive analytics model will not provide more accurate results.
Overall, getting the quality and quantity of the training data right is the best investment you can make in building your predictive capabilities.
Challenge #4: Model Usability
From a pure “mechanics” standpoint, the analytical concepts and methods are there to reflect the complex thinking required to predict the behavior of an enterprise’s’ combination of technology, information and infrastructure. Predictive software relies heavily on advanced algorithms and methodologies such as logic regressions, time series analysis and decision trees. The problem is that the workings of the kinds of systems that SREs deal with are very hard to understand in the first place, let alone predict. It is not uncommon even for SREs to not entirely understand how a system works if they were not the ones who initially set it up. In addition, every company’s system is different from the other.
When data scientists design predictive models without years of experience in an SRE capacity in multiple environments, ie. without leveraging subject-matter expertise, it rarely works perfectly. Often the models they create lacks usability, even after the validation process to confirm its quality. A modeler will split the available data into a training set where the model is run to generate predictions and compare performance against the actual outcomes. That will require multiple iterations to generate a model that has high overall predictive accuracy, and 9 out of 10 times the result is overfitting, ie. the model is perfect for the training data. A subject matter expert in the domain of technical operations can cut the cycles of iteration and get to a predictive model that works with an acceptable level of risk.
In short, implementing predictive analytics in an SRE context takes a significant investment both in setting-up the right data set as well as applying the right logic. The team at SignifAI has worked on solving these various challenges over many years, stay tune to a future blog post that will provide some pointers on how to achieve the best prediction results.