Latest posts by JP Emelie Marcos (see all)
- Learnings From My Discussion with Eric Yuan, CEO and Founder of Zoom - November 14, 2017
- Four Predictive Analytics Challenges Facing Site Reliability Engineering Teams - August 24, 2017
- Monitoring Microservices at SignifAI with AI and Machine Learning - June 27, 2017
I really enjoyed SRECon17 in San Francisco on Monday and Tuesday this week, I think it’s becoming one of my favorite conferences. It’s packed with great content, attended by interesting professionals and it’s not so large of an event that you feel you can’t make meaningful connections. Thanks to the USENIX team, and to all the volunteers, for putting the event together. For those who are unfamiliar with it, the conference is oriented toward professionals responsible for system availability and scalability, deployment, development and maintenance of cloud infrastructure. I would guess that encompasses most of the job descriptions of the folks I met. Needless to say, the content is technical and the audience was a mix of hardened SRE veterans, looking to network and pick-up a few new ideas and techniques. There were also newcomers to site reliability engineering, for instance, companies only now transitioning to the cloud who were looking for a lay-of-the-land.
I want to share only a few remarks, but please remember there were over 40 technical sessions packed into two days, distributed along three parallel tracks. Therefore, it was impossible to attend every session and I won’t be able to do justice to all the awesome speakers who participated in the event within the following 1000 words or so.
“Reliability When Everything Is a Platform: Why You Need to SRE Your Customers” by Dave Rensin, Google
Dave described the concept of Customer Reliability Engineer. He basically explained that if you are a vendor who is trying to make your SRE resources work with those of the customer directly, to take the viewpoint of the customer. It involves focusing, not just on your particular reliability metrics, but, on those of the customer as well. That way, when an issue arises, everyone shares the same data instantly. The CRE team complements the SRE team, trying to help proactively, fixing issues together with them as one team. It’s an awesome way to build trust. Obviously, it makes sense in the case of Google, which has spare resources available. I commented to Dave that not all companies have the ability to make that kind of an investment, which he understood. Dave described Google’s CREs as “genetically modified SREs who are not just good at talking to machines, but who enjoy talking to people too”. Bonus points for the fact that the talk was both funny and informative.
“Principles of Chaos engineering” by Casey Rosenthal, Netflix
Casey was not the only Netflix speaker that graced the event and the theme was touched upon in several other sessions, for instance, “The Road to Chaos” by Nora Jones or “Breaking Things on Purpose” by Kolton Andrus. I am sure that Chaos engineering is old news to lots of you, but I personally enjoyed furthering my understanding of it. For those who are not yet familiar with it, I recommend visiting the Netflix blog. At a high level, it means consciously introducing errors in production to ensure that a system is fault tolerant. The important point here is that they deploy “the chaos Monkey” to their production environment, not just at staging or QA, which is pretty daring. This is a very original way to test the limits of a system.
Sessions on Bootstrapping and Organizing your SRE team
“I’m an SRE Lead! Now What? How to Bootstrap and Organize Your SRE Team” by Ritchie Schacher and Rob Orr, IBM
“It’s the End of the World as We Know It (and I Feel Fine): Engineering for Crisis Response at Scale” by Matthew Simons, Workiva
“Changing Old Habits: Meetup’s Path to SRE” by Rich Hsieh, Meetup
What I liked about those sessions is that they catered to a large part of the audience which is interested in transitioning to Site Reliability Engineering (from a more traditional IT helpdesk structure for instance, or from nothing at all). It’s refreshing to see practical advice and frameworks on where to start, or what to tackle in year one, as opposed to just providing content for super advanced users. One ghastly tidbit mentioned by one of the speakers: “alert fatigue was the cause of death for one colleague”, he meant literal death, i.e. the person died…It was definitely the first time that I had heard something like that. The problem of alert fatigue is very real, and it needs to be fixed.
Sessions on Analytics
“Reducing MTTR and False Escalations: Event Correlation at LinkedIn” by Michael Kehoe, LinkedIn
“Anomaly Detection in Infrequently Occurred Patterns” by Dong Wang, Baidu
“A Practical Guide to Monitoring and Alerting with Time Series at Scale” by Jamie Wilkinson, Google
I would just point out that all three of the above-mentioned sessions were focused on a specific and consistent partial data set. A narrow set of computational and analytical techniques were applied to that data set for the purpose of uncovering very defined types of results (or answers). Most companies that apply data science to DevOps are at similar stages: the analysis performed is relatively specialized, or narrow, and the results are confined to identifying only a small subset of potential issues. The next step beyond that is to replicate the broad level of analysis that a human expert is able to perform, this is where machine intelligence comes in.
“SRE Isn’t for Everyone, but It Could Be” – A special session on diversity and inclusion by Ashe Dryden, Diversity Advocate and Consultant
Here I have to compliment the organizers for adding a pretty unusual session to the program, especially at a tech event. This session was actually packed and this is the one that generated the most questions and comments from the audience. It touched upon instilling and supporting a culture of diversity and inclusion. A lot of people openly say they are all for it, but hearing Ashe describe the subtle ways that one can make someone else feel marginalized makes you realize that it’s very easy to make mistakes unwittingly. It pays to educate oneself on the topic. As an executive, I found the session very useful and I look forward to reading more from Ashe on her website.
Feel free to point out the ones you liked and those that I have not mentioned in the comments below.