Latest posts by JP Emelie Marcos (see all)
- Four Predictive Analytics Challenges Facing Site Reliability Engineering Teams - August 24, 2017
- Monitoring Microservices at SignifAI with AI and Machine Learning - June 27, 2017
- Monitorama PDX 2017 Recap - May 31, 2017
Last week, along with about 600 other SREs, SaaS, big data, and platform engineers, the SignifAI team converged on Portland for the annual Monitorama 2017 conference. Minus an unplanned change of venue on Day 2 because of a power outage, the conference went off without a hitch with lots of interesting talks for novice and expert SREs.
This year’s Monitorama show was especially exciting for us as we used it as an opportunity to announce the general availability of our machine intelligence platform for DevOps. You can learn more about how machine intelligence can help you perform root cause analysis, predict downtime and help you focus just on critical issues by reading our announcement blog that contains all the details.
Aside from our announcement, It was great to connect with attendees from a variety of companies, who were all working on solving some very interesting infrastructure and application problems.
You can watch the archived streams from the conference here:
“The Vasa Redux” by Pete Cheslock
Closing the first day, we were treated to an awesome talk about failure. To illustrate a number of concepts around common sources of failure of modern software engineering projects, Pete chose to use a real-life historical example from a completely different field: the construction and destruction of a Swedish warship, the Vasa, that was built between 1626 and 1628. The ship sank on its maiden trip, about 1,300m from the harbor, after encountering winds just a little stronger than a breeze. The facts are fascinating and Pete goes into a great level of detail, leveraging every single opportunity for both humor and drawing parallels with current production systems issues and problems such as: bad project blueprint, bad documentation, bad project management, feature creep, lack of measurement and communication, the list goes on and on. Both the quality of the material as well as the delivery made for a very enjoyable talk, which we recommend you check out when you get the chance either online or at a future devops event.
Anomalies do not equal alerts by Dr. Betsy Nichols
Dr. Nichols gave a very solid talk related to alert false positives and how to enrich the data with context to extract more meaning from it. Context is additional information that makes it easier to leverage and understand the original data set, things like code deployments timing, migrations, tags, associations of values with metrics or servers etc. Simple message, well explained, good reminder overall.
Tracing Production Services at Stripe, Aditya Mukerjee
Aditya gave an energetic talk on the topic of tracing, which he defines as “full cross-sectional visibility into all aspects of the production system, complete with context and state”. He went on to describe the metrics pipeline that he has implemented at Stripe and open sourced as Veneur. It’s a classic tip-sharing and best practice technical session, together with useful open source code, along the same lines as a lot of other talks at the event.
A passionate talk about being on-call by Alice Goldfuss
Alice Goldfuss’ talk about her love-hate relationship with being on call included some funny #oncallselfies that SREs took precisely at the moment they got paged…proving that SRE’s can continue to count on being paged at the most inopportune times. Alice elevated her talk by addressing broader issues such as the human cost of the on-call process, the culture gap between Ops/SREs and Devs. An example of her provocative ideas: “put the developers on call more often”, so they understand what it’s like. Relevant and enjoyable talk.
Companies are starting to employ artificial intelligence and machine learning
Interestingly, the majority of the conversations around the impact of machine learning and artificial intelligence on DevOps happened during the breaks and at the after hours events. I anticipate that it will change relatively quickly, if the content of other Devops conferences is any indication. Here’s what we learned:
- Existing solutions that tout to make use of machine learning actually use basic algorithms in a “one size fits all” approach
- Insights that aren’t informed by the expertise of the operations teams fail to reveal anything they didn’t already know
- Having lots of monitoring tools generates a lot of data that is very challenging to correlate, especially when you are dealing with logs, events and metrics…and forget about doing these correlations in real-time.
- There is lots of interest in addressing issues before they hit the production environment. People are aware that it is difficult to predict consistently and accurately but there remains a lot of pent up demand for a solution that is able to do prevention of production issues reliably.
Overall, we very much enjoyed the event and made a number of great connections. We look forward to being back next year.