Latest posts by JP Emelie Marcos (see all)
- Learnings from My Discussion with Jennifer Tejada, CEO of PagerDuty - June 5, 2018
- Increase IT Operational Efficiency with SignifAI - April 26, 2018
- Learnings From My Discussion with Eric Yuan, CEO and Founder of Zoom - November 14, 2017
SignifAI recently (H1 2016) carried out a survey over a period of 6 months which focused on DevOps decision makers. We wanted to take the pulse of the current state of affairs in a larger sample of environments than the ones we were familiar with.
This greater insight would also form the basis of a validation exercise based on our own experience of managing technical infrastructure. This experience pointed to a large degree of pain in running technical operations. In fact, few professionals feel that they ever achieve their desired level of efficiency and stability. Obviously, where there’s pain there’s opportunity, in spite of the already awesome level of open source or commercial technology that has been introduced in this area.
In the end, the survey did bear out our own personal experience in the area.
Excerpts from the OPs Decision Makers Survey
Pain point 1:
Tens of millions of people across the planet babysit Operations. Yet more than 80% of our respondents, most of them from well capitalized companies, agree that they are constantly short-handed.
There are many factors that explain these results – the first one is that Technical Operations still mostly rely on people. As I mentioned in a previous post, the level of manual processes is extremely high. These results are slightly biased by the fact that most of the survey respondents are based in the Bay Area, where there is a ‘war over talent’. However, this result also speaks to the level of complexity in operations, which has increased over time; further results elucidate this fact.
Pain point 2:
Running technical operations is increasingly complex. The massive penetration of tech ops monitoring solutions in the enterprise should make life easier. Having plenty of data on the system is surely better than not enough? However, our survey shows that this is not the case. In our previous environments, the number of ‘real time’, highly aggregated metrics available and displayed to folks running operations exceeded, on average, one thousand at any point in time. This level of data is too noisy to use. People start suppressing most of it and revert to sampling – it is an ironic twist in the era of Big Data where we are supposed to analyze 100% of the data available.
To put it in simple words, there are just too many distractions: too many alerts, too many threads in ChatOps to pay attention to, too many graphs and dashboards, too many emails, too many pages. Most of what is shown is superfluous, but it is very hard to distinguish the signal from the noise. Getting to an answer fast is most needed when there are issues in production. In these cases, the problem of prioritization and higher levels of analysis become more acute.
Monitoring solutions provides data first and foremost. Most are attempting to help the user get to an answer by providing greater visualization capabilities, as well as a wide range of mathematical functions. But we are still relying on human capability. Different professionals come with different levels of capabilities and experience. The result is a wide range of differences in efficiencies from one company to the next, depending on the quality of the team. Even within any given company there may be a wide range of differences in abilities to respond and fix issues from shift to shift.
Monitoring solutions are great tools but as one of our respondents puts it, “we all rely on human eyes”. It is worth noting that this quote came from one of the ‘multi-unicorns’ or ‘dragons’ as they’ve been termed; these companies can afford the best tech money can buy.
There are many outcomes from this. The first is the level of downtime that each environment experiences. Annually, we estimate that over $10B in lost revenue is directly attributed to downtime. So although most things work, most of the time, there is still some real margin for improvement.
Pain point 3:
Teams are in reactive mode too often.
Respondents should have understood, since we were using the drastic terms of ‘fire fighting’, that we were asking about edge cases; this mode of operation should be the exception rather than the rule. However, in over 60% of cases this is exactly how it feels. Obviously, this is all subjective but it points to the human cost of not being in control. Reactive system administration does not foster passion. In fact, it generates high stress levels. People tend to leave environments where it happens too often.
Overall we found that our own answers would have matched with the majority of the results, the situation we had experienced seems relatively common. The good news is that there are solutions to this problem and many approaches that companies are taking to resolve it.
As this blog unfolds, we will continue to discuss the results of the survey and look at solutions. So please, keep reading.