Latest posts by Guy Fighel (see all)
- Turning the page: news & reflections - February 6, 2019
- Decisions: a deeper look - July 20, 2018
- Introducing Decisions: smart correlation for busy ops teams - July 16, 2018
With the start of 2019 comes an exciting announcement: SignifAI is joining New Relic! Our teams are pumped to work together on our shared vision of bringing machine intelligence to DevOps and SRE teams. You can learn more about the acquisition, what’s coming next, and New Relic’s perspective in this blog post.
As we’ve been taking the first steps of this new journey and working with the New Relic team on the direction for our team and technology, I’ve been reflecting on how SignifAI has evolved throughout the past three years. Here are some thoughts about the past, the future, and what we’ve learned…
SignifAI – The Beginning
The idea for SignifAI originated from experiences my ops team at a prior company had as we scaled our infrastructure to keep up with a quickly growing product. We found ourselves searching for a tool that could cut through the noise of a complex monitoring stack and show us only the things that were most important, in a way that felt as natural and insightful as though we’d done all the manual investigation ourselves. At that time (almost 6 years ago), machine learning was a buzzword used only in academia, the concept of AIOps was nonexistent, and the industry’s shiny new slogans were BigData and Hadoop.
I remember spending days searching for solutions, as I had a fairly large budget, and couldn’t find any platform or tool to do what I had envisioned. I was frustrated…. How could it be? I remember starting deeper technical research on areas I believed could help to solve our problem and along the way did massive reading, researching, and hands-on experimentation. I learned a lot about data pipelines and ingestion, real-time analysis, map reduce jobs, distributed systems, clustering, and many different algorithms. I actually focused mainly on solutions from completely different domains to try and learn from them and see if they could apply to operations.
After tons of research and experimentation with real production high volume data, lots of failure and lots of learning, we started to build the first version of SignifAI. In the beginning, it was a core expert system integrated into an automated pipeline with some logic, some automation and some minimal ML on events and time-series data. By early 2014, we had a core system (completely command-line based with a bunch of config files) finding correlations in our monitoring data and showing us only the most important issues. After experiencing the technical success, I had this internal conviction – we achieved something great here, and we could take the same ideas to create a SaaS-based platform to help other teams.
I wasn’t kidding, and I knew there was still a lot to experiment with and tons of work to build a real product, but that spark of belief – the passion of solving a large and difficult problem for other teams in a relatively generic way, and the proof points we achieved – was what motivated me to decide to rebuild it as an independent product. SignifAI was born with a strong purpose to empower other SRE and DevOps teams with powerful technology and an open-platform approach. We were (and still are) on a mission to change the way digital businesses analyze and understand their technical production environment’s uptime, reliability and availability. Delivered as a SaaS-based machine intelligence platform, and connected to your existing tools and workflows, we help Site Reliability Engineers maximize their day-to-day service level objectives.
As SignifAI grew, we developed features that shaped the core value of the platform: integrations for over 60 sensors and our innovative approach for Active Collection vs. a webhook-only, passive approach; the Control Center, a single pane of glass for monitoring across your full stack; Teams, empowering our enterprise users with more fine-grained control; and Decisions, which gave teams access to better understand and customize the logic that drives correlated Issues in the platform. We launched special offerings for teams using Prometheus and OpenShift. We released Chewie, a streamlined version of the platform that plugs directly into teams’ existing incident management services.
Through this journey, we learned more than we could have imagined about the quickly-changing world of Site Reliability Engineering and the elements of a successful product.
What We’ve Learned
In the past 3 years, we’ve talked to hundreds of DevOps and SRE teams about their experiences: the stress of being on-call, struggles to maintain an infrastructure stack that constantly becomes more complex, victories in proactive problem-solving, and hopes for the future of IT operations. There have been many lessons along the way – here are three of them:
Attack shared frustration
Every team we worked with is different in the tools they use, their authority and responsibility structure, and the way they measure success, but they all share the same vision: a streamlined, highly automated system that minimizes stressful and costly issues and enables engineers to do proactive, sustainable, creative work.
The greatest challenge we had in developing the SignifAI platform was designing features that were useful and accessible to many kinds of teams – everything from traditional NOC/ops to Google-esque, highly automated SRE environments. Our best successes came in prioritizing solutions to problems that were shared by every team, regardless of their size or level of sophistication: easy-to-integrate sensors that didn’t take hours to set up, simple links between Issues and communication/collaboration tools, a combined view for alerts from multiple tools, etc. SignifAI’s most-loved features worked towards the shared SRE vision. It might seem pretty obvious, but it was actually not. There were tons of tradeoffs we needed to make in order to generalize as much as possible for a working solution that met the majority of our users and the industry. Remember, this is a pretty fragmented market with various solutions; thirds party, open-source and custom-built internal tools, each with different flavors and technical requirements.
Prioritize accessibility and understanding
All of the sophisticated tech in the world isn’t useful to our users unless they can see, understand, and tailor the system’s logic to fit their system. We learned that it was imperative to make sure every decision SignifAI made, from correlating and categorizing important Issues to suppressing or delaying noisy ones, was easy for users to understand and give feedback on. When we developed the Decision engine, we kept this core idea in mind: machine learning is only as useful as it is accessible. We see other solutions making the mistake of considering the algorithms and models smarter than the person on call. Pretty early on, we set one of our core values to view our platform as an augmented team member and as an extension to humans – not a replacement. We have also learned that the use case we are solving for is simply not one that can tolerate time-consuming model training, and we could not expect the first responders during an on-call shift to focus on training and adjustments.
Don’t be afraid to adapt and change
Like any startup, SignifAI went through a series of changes as we worked to find a suite of features that served our customers’ needs. One of the major lessons we learned from working with customers is that for teams who already have established workflows around incident management, triaging, and on-call collaboration, introducing a new tool that sits “in the middle” of the stack can be incredibly different and requires a lot of trust and training. Our core platform, which uses sensors to connect to monitoring tools and pushes correlated Issues to a triaging tool, enabled us to gather the most (and most useful) data from customers’ systems without needing tons of configuration. However, the workflow change introduced by adding the platform didn’t work for every team. That’s why we introduced Chewie, a solution that plugs directly into teams’ existing incident management tools.
Adapting and introducing new ideas can feel terrifying, like giving up on a dream – but the most powerful lesson I’ve learned throughout SignifAI’s journey has been that adaptation is the greatest opportunity to create something better than you could have imagined. Instead of treating changes as “letting go,” treat them as chances for creative freedom and a more open mind. As long as you’re still working in service of your big-picture vision, changes are healthy and could represent a breakthrough for your product.
Joining the New Relic team is incredibly exciting – I can’t wait to see what comes from our shared ambitions of creating a platform that increases automation, truly understands problems, infers reasoning and suggests solutions using Applied Intelligence. There is a lot to write about the decision we took, the reasons we chose to join New Relic, and what has changed over time, but one thing is for sure – our belief and vision remain the same. When I picture the future of IT ops, I hope for more solutions like SignifAI and New Relic that help reduce the anxiety of being on-call and empower SREs to do their best work every day. We are committed to continue pushing ourselves and our shared product to get there as fast as possible.