The world of SRE and DevOps teams is all about fast responses – being able to quickly diagnose and resolve a problem can mean thousands of dollars or clicks for your company. Common KPIs used to measure the effectiveness of an SRE team are Mean Time to Detect (MTTD), Mean Time to Understand (MTTU), and Mean Time to Respond (MTTR). Imagine the area of this circle, which represents the SRE cycle, as your total cost:
Traditional approaches to improving operations efficiency include hiring more engineers, configuring more tools, and training your existing engineers to better understand your system. However, each of these options only addresses one of the key KPIs, and in some cases it can make others worse:
As the complexity of production systems grows with the introduction of more tools and new technology, DevOps and SRE teams need a more sustainable solution for incident management. That’s where event intelligence and automatic correlations come in.
Every step of the SRE process, and each corresponding KPI, is closely tied to the others. So why not use a tool that takes advantage of this relationship to improve all three together? With SignifAI Decisions, an AI and machine learning platform that automatically discovers correlations in event data across your full stack, each small improvement to one step in the cycle positively impacts the others. Let’s check out an example:
With SignifAI Decisions, you can create customized logic based on your knowledge of your production system. In this example, a spike in volume of low-priority incidents for an application indicates a larger underlying problem. The priority of the automatically correlated Issue will increase, and your MTTD just got faster.
When an SRE receives a notification about this Issue and checks it out in SignifAI, they’ll immediately notice some relationships between the events. In addition to the context from the Issue correlation, SignifAI uses an automatic NLP classifier to categorize your events based on 5 critical symptoms of SRE incidents. A single glance at the Issue reveals that the problem is related to health check availability, and the MTTU is speedier than ever.
Finally, SignifAI provides smart Recommendations based on previous Issues, so your on-call SRE can see the process other team members have used in the past to solve the problem. Related incoming events will continue to be correlated into the parent Issue, reducing notification noise and distractions for SREs. This easily accessible historical context and increased focus will decrease your MTTR, meaning minimized production impact for your customers.
So, to recap – customizing Decisions, as well as activating Suggested Decisions created by SignifAI’s machine learning engine, will lead to faster and smarter detection. The enriched context of correlated Issues will lead to faster understanding, and reduced notification noise from these correlations will allow your team to focus, leading to faster resolution.
Latest posts by Annika Garbers (see all)
- Stop getting paged for useless alerts - October 4, 2018
- With clarity comes focus: how to reduce your SRE team’s cognitive load - September 10, 2018
- 5 time-saving features in Decisions - August 9, 2018