Latest posts by JP Emelie Marcos (see all)
- Four Predictive Analytics Challenges Facing Site Reliability Engineering Teams - August 24, 2017
- Monitoring Microservices at SignifAI with AI and Machine Learning - June 27, 2017
- Monitorama PDX 2017 Recap - May 31, 2017
We write a great deal about monitoring, so we are often asked how we monitor our AI and machine learning platform. At SignifAI, we have adopted a microservices architecture because it provides us with a number of benefits, including speed, operational flexibility, the ability to do continuous delivery. Our infrastructure is resilient and scalable and ultimately, enables us to deliver a better end-user experience. That’s from a development and deployment standpoint. But while microservices allow our development effort to run better, we found that monitoring microservices differs from monitoring monolithic systems in a few ways:
- Complexity: Splitting up an application’s functionality into many separate components obviously makes the program more complex on the back end. Each piece runs separately and is controlled separately—and therefore must also be monitored separately, as well as in combination to the rest. It is also complex to monitor service across different regions, availability zones and on multiple versions running in parallel.
- Understanding data flows: Visualizing data and its relationships across a variety of microservices is far more challenging than doing so in a single service. The data is a lot more disaggregated, it’s harder to create a high level visual summary like in the monolithic case.
- Scale: Microservices tend to generate a higher volume of service alerts from a larger number of origin points. They also require more containers and more machines to run a single application. In general, there is a lot more data to understand and respond to.
We still believe that the advantages of microservices far outweigh the challenges that come along with monitoring them. Here are a few of our own internal guidelines for monitoring our a microservices environment:
One of the typical benefits of microservices is to be able to focus on the application layer, which makes finding and fixing problems easier. In theory, the infrastructure is considered to be less critical because it should be able to heal itself. In reality, it’s likely that a team would not be able to avoid a certain partial concentration in the way the application layer components are hosted, leaving a certain level of exposure to infrastructure failures. If physical servers go down, everything they host goes down with them and, depending on what they host, that failure may have a visible impact. Therefore, we like to keep an eye on both the infrastructure and application layers as a precautionary measure.
Standardize processes and nomenclature
Sounds obvious, yet we learned that through the mistakes we made when we started. Things like timestamps, styles, languages, and definitions must be consistent across projects otherwise you will experience a variety of failures that will affect the mesh network of interdependencies you are creating across microservices.
Focus on the metrics that matter
We focus on monitoring the following metrics:
- Every single API
- Instance status
It should be self-explanatory. For instance, measuring latency, or how fast a specific microservice is operating, is important for anticipating whether another dependent service will slow down as a result, which could result in another subsequent reaction, and so on and so forth. Measuring saturation indicates whether a specific component of a system will be lost at a certain point, impacting the overall available resources, etc.
We centralize everything on SignifAI. That’s probably the most important step in our case. In general, you need to find a way to centralize the data across microservices and make sense of it. We prefer to get the analysis done at scale, fast and without logic constraints so obviously we do our own dogfooding. Here are some of the benefits we achieve:
- Tailored correlations: Each and everyone in our team is able to set-up correlations specific to their particular focus and sensitivity, with complete flexibility in the way it is expressed, the scope of data that it relies on (what part of the system it pulls from but also the type of data: time series data, logs and events) or the number of variables and conditions that it requires.
- High signal to noise ratio: The result of the above correlations, as well as a range of specialized algorithms that eliminate redundant or completely irrelevant alerts, is for our team to see mostly things that matter to us, which represent a very low volume of information. That way we are a lot more focused and have more time left to work on development.
- Active inspection: There are so many metrics and events that it’s impossible to track all of them manually. With our service we are able to get alerted on issues related to things we are not actively tracking, often ahead of time, because we either do not track the metric or because we did not set-up the right event. In other words we are able to find behaviors with no previous knowledge.
- Incident management and collaboration: SignifAI allows people in our team to keep collaborating across other a variety of apps in a way that’s more efficient because we only send issues to notification systems based on priorities and after applying specific logic to different types of data sets.
- Leverage from tribal and documented knowledge: We don’t have to write long postmortems or runbooks, we capture the essence of the issues in just a few clicks and comments on the spot and that knowledge is ready to be correlated to future issues so it can be surfaced in front of us in the right context. No need from us to spend time doing manual searches for past documented incident knowledge management, and no risk that we would forget or not find the relevant information.
It’s true that microservices add a layer of complexity to the monitoring process but the tradeoff is worth it. Because we chose this type of architecture, it helped us design SignifAI in a way that would work for us, and by extension for anyone using microservices. Through process discipline, focus in terms of metrics, and centralization of the data on our own service, we have a complete understanding of how things run, what’s causing incidents and how to fix them.
Monitoring microservices takes some getting used to, but we’ve been able to make it work here and so can you!