Latest posts by JP Emelie Marcos (see all)
- Learnings From My Discussion with Eric Yuan, CEO and Founder of Zoom - November 14, 2017
- Four Predictive Analytics Challenges Facing Site Reliability Engineering Teams - August 24, 2017
- Monitoring Microservices at SignifAI with AI and Machine Learning - June 27, 2017
I wrote about how to achieve 360 degree monitoring in a previous post. My article implied that full coverage was the goal for effective monitoring. In reality, it’s only an intermediary step: you will rarely feel like you are done once you have all the tools required to monitor every component of your system. You will likely find yourself facing many common limitations and still wanting for more, depending on your environment and your team. I want to share in this post the different types of limitations that teams experience with regards to their monitoring tools, as well as what next steps they typically consider.
Types of limitations teams experience with regards to TechOps monitoring tools:
1. Siloed data
In order to measure everything, you will implement a number of monitoring applications, each covering a different portion of the system. Your data will then be distributed across various databases and tools. If you have an issue in production, you will find yourself jumping frantically from one application to the other in order to garner the nuggets of information relevant to solving the issue at hand, which is a time consuming, resource intensive process that is prone to human errors. The distribution of data sources is an impediment to getting fast insights that inform an actionable decision, so that makes it a lot less useful to the teams on the front line.
2. Alert Noise
Each monitoring tool comes with the ability to set-up alerts. Over time, you end up in a noisy environment, with many more superfluous alerts than your team can handle. Identifying the relevant signal among all that noise is like finding a needle in a haystack. We wrote about the art of setting up alerts which you should re-read to mitigate this risk but at a high level, the noise level is driven by:
- The fact that the level of logic possible when setting-up alerts is limited, and rudimentary alerts end-up creating more events than originally intended, creating noise.
- Different people, or different teams, set-up alerts with intentions very specific to their particular environment, but the alerts themselves are pushed to a much broader audience, creating a disruptive spammy stream of alerts to others
- The circumstances that led to an alert change (e.g. An alert set-up to flag future issues similar to one that you just had) but nobody has the time to revisit current set of alerts, so a bunch of them continue to generate events because their triggers are too broad (e.g. A threshold based alert).
The resulting high level of noise reduces the value of the information pushed overall and puts the teams in a reactive mode, constantly fighting relevant or irrelevant symptoms as opposed to the root cause of an issue.
3. Inefficient root cause analysis
All the tools that a company implements will, cynically, dump a massive amount of data onto someone’s lap. They are all “self service” and rely on the assumption that someone, or everyone, in your team will have the ability to do the analysis through a simple few clicks. This data democratization sounds ideal, but it’s not sustainable as the volume of data and noise, as well as the speed at which your environment changes, continues to increase. Not every team has budding data scientists in their tech ops team. In reality, it takes engineers a large amount of effort to identify the relevant data for any given issue, gather it into one place, make sense of it and identify the appropriate action required to fix it, not even mentioning the effort needed for documentation. I have yet to find a technical operations team that feels it is appropriately staffed, the more data you have, the bigger the “data analysis” load you impose upon your team and your company may or may not have the ability to absorb it.
4. Effort required to capture knowledge
Ideally, after every major issue, you should take the time to incorporate the learning from it into your knowledge base, whether it is in the form of a new runbook, a post-mortem on the wiki or inside a chat ops channel or a paging tool. All of these knowledge management processes take time, which your team has a lot less of, as a result of the time consuming process to get to the root cause of an issue through normal monitoring tools, as well as being constantly on the receiving end of a flood of false positives alerts. In addition, monitoring tools do not allow for fast, system wide comprehensive documentation, they are focused on their particular scope (when they do include documentation capabilities). In that context, managers have a hard time getting traction with engineers on all the manual documentation initiatives, and you end up staying stuck reinventing the wheel every time.
So what is the next step beyond your traditional monitoring tools?
Here are four functionalities that teams typically still need:
1. A single pane of glass
It’s odd that companies go through all that effort to deploys a number of monitoring tools, for them to go back and want everything in one place. If your team is accountable to an executive, or other parts of the organization, who only needs a higher level view of your environment, dumping thousands of data points across twelve screens on their lap does not help them get that quick snapshot view, so you will want to consolidate and summarize. Often times, the teams themselves express the need to avoid looking at multiple screens because it becomes overwhelming. To aggregate your monitoring data into one place is tough. It requires a lot of thinking in terms of prioritization and knowledge representation (how do you make it easily digestible?). The companies that we have seen attempting to get it done are typically very large and they devote a number of developers to this type of project for a long period of time, which obviously, not a lot of businesses have the ability to do.
2. Ability to correlate data across tools
This is a universal need for every team. In 90% of the case, the analysis required to perform the diagnostic of an issue or to work on prevention, requires the ability to correlate data coming from any part of your system. Besides the fact that the data is distributed across siloed repositories, it is also a particularly difficult challenge when the data is inconsistent: how do you correlate log data with times series data and events? Very experienced tech ops engineers are able to do it, it still takes them time and there is the risk of human errors, but they can do it. The problem is that it is hard to recruit experienced tech ops engineers. Correlation is typically a critical analysis performed in the process of diagnosing issues, but it’s really hard to do when all the relevant data is not easily accessible and it is presented in an inconsistent format.
3. Performing predictive analysis
Teams almost never get the chance to do predictive analytics: it’s just hard to find the time and prevention in general not as ingrained in the culture of tech ops as “fixing stuff” is. But the most common aspirational goal of DevOps professionals is to spend more and more time being proactive in addressing issues before they even have any negative impact. Prediction requires the same need for data consistency across the board as the diagnostic process. The monitoring tools that are starting to provide some anomaly detection only use the small subset of data they are able to track, and that data has to be available in a consistent format so, although it’s useful, the level of insights that you get is limited.
4. Extracting more value from existing knowledge bases
Most of the knowledge in your tech ops organization (lessons learned, best practices, fixes etc.) is in the head of your tech ops team members. A small amount of it gets captured through documentation, and what little is documented is typically sitting in a static repository, like your internal wiki. In that context, it’s up to someone in your team to think about manually triggering a search and read that up to leverage that knowledge for the purpose of solving a current problem. The leverage you get from your collective learning and your history is very limited in that configuration. What would be great, instead, is for this knowledge base to be a lot more extensive by being generated automatically on a continuous basis, without involving a ton of effort from your team. It would also be great if that knowledge was brought up in front of the user automatically at the right time, in a relevant context. Let’s say that you have an issue in production now that is similar to something you experienced a year ago: it would be great if whatever diagnosis was applied then, together with the fix and the name of the team members that worked on it, appeared automatically in front of your team. That’s how you start getting leverage from your knowledge base and a lot of professionals are demanding these types of capabilities.
In summary, your monitoring tools provide you with an aggregated view of a massive data set, but they each focus on a portion of the system and it’s up to the user to figure out what to do with it. Those tools also create a lot of noise. What’s missing is the ability to perform diagnostic and predictive analysis across the full system health and behavior data, as well as the ability to get leverage from the knowledge base. In an upcoming post, I will write about how to complement your monitoring tools in order to eliminate their inherent limitations.
Stay tuned and thanks for reading.