Latest posts by JP Emelie Marcos (see all)
- Four Predictive Analytics Challenges Facing Site Reliability Engineering Teams - August 24, 2017
- Monitoring Microservices at SignifAI with AI and Machine Learning - June 27, 2017
- Monitorama PDX 2017 Recap - May 31, 2017
When I attended SREcon earlier this year, I met the head of an IT team working at a major US airline. We started chatting, and I asked what he was doing at the conference, and he said he was just there to learn and observe as they are thinking about transitioning to SRE. As we talked further, I learned that this airline uses extremely old-fashioned operations methods, has a sizeable IT team devoting its time to responding to tickets, runs their own datacenter with old hardware. I thought it was great that he was attending SREcon and exploring options to finally leave his legacy IT system behind—something that, in my opinion, every company should be considering by now.
Google introduced site reliability engineering (SRE) in 2003, and, since then, it’s evolved into an incredibly useful position that essentially blends software engineering with operations (similar to DevOps, but with a slightly different skillset). Transitioning to SRE from legacy system administration or IT processes is admittedly a shift in culture, but a really worthwhile one. SRE helps companies run their technical operations better by integrating skills across the organization and eliminating unnecessary back-and-forth. In SRE, software engineers handle interventions and also create functions to automate and scale processes, leading to less conflict and higher productivity and efficiency all around.
That discussion gave me the idea of writing a few words of advice. So if you’re thinking of transitioning your company to SRE, here are a few things to consider:
Learn how to develop
Sounds obvious, learn at least one language, such as Python, it could also be Go or Ruby. The SRE philosophy is that toil (ie. solving problems by hand) is bad, and everything that can be automated should be automated. Time spent on development is less time spent on toil. In reality, this can be a tiny bit shaky since the world isn’t perfect. Outages happen, customers get mad, data centers go down, etc. For the most part, however, the vast majority of SREs can expect to be writing code around 50% of the time. The assumption here is that you start with a team that has more of a system administrator and IT background. The other way to go about it is to complement your team with members with a software development background. Whichever way you decide to go about it, the eventual mix should be around 50% developers and 50% sysadmins.
Stop being only reactive
The traditional process of reacting to tickets is limiting your impact. Rather than devoting time to fixing problems manually and only when they arise, begin proactively understanding your business needs and aligning them with the goals of the rest of the company. Use your resources and talents to preempt problems from the ground up by monitoring, engineering, and automating—this way, less time has to be wasted on fixing issues one by one. An SRE should devote half of his or her time to traditional “ops” tasks—such as tickets and manual intervention—and the other half doing engineering work that helps automate processes, scale systems, and develop new features. Most importantly, stop worrying that you will not be needed. When you transition, you will have removed yourself from being at the center of the lot of processes, so you won’t be the bottleneck anymore, and you will be needed in ways that add more value to the business and are more fun for you as well, being proactive, creating new tech, engaging more with the rest of the Company.
Learn and understand the new paradigm of microservices
Microservice architecture is a new and popular way of building enterprise applications as a series of independent, interconnected, deployable modules that can each support a different platform, such as mobile, web, or wearables. Because SRE aims to maximize scalability, microservices are key: An individual component can scale easily and independently, without having to take all other parts of the program with it. Beyond the major companies, like Amazon, Twitter, and Netflix that all use microservices, it is becoming the architectural norm for any digital company.
Keep track of any and all activities such as system health, behaviors, cost efficiencies, speed, and any other useful data. By automating the monitoring process, you’ll be notified more quickly when things fail and also be able to fix problems faster when they arise because all the relevant information is already on hand and searchable. I wrote more about what I call 360° monitoring here and about what a godsend it was for our company when we started using it.
Don’t do it alone
Find another company that’s going through the same transition to be your ally. Agree to go through the process together, communicating with each other throughout. Run the goals and targets that you set for your yourself through them first to make sure that you adopt an appropriate pace of transformation, not overshooting but not going too slow either. This partner can also be a consultant or a company that has already completed the shift. By working through the process with someone else, you are more likely to power through the challenges, you will learn from each other and will be able to avoid pitfalls and mistakes.
Don’t stay tied down by past choices, but rather commit to transitioning to the current state of the art. When reading about the microservices mention, I bet that some of you must have thought, maybe not right now, too much at this stage. In reality that’s where you’ll increase your chances of success. The amount of personal and Company effort required to acquire the new skills is quite tremendous, so shoot for the transformational returns. Making this change is guaranteed to be a good thing in the long run.
In the end, SRE is guaranteed to make your company run smoother. By combining operations with software engineering, the two fields’ incentives can be more easily aligned. You’ll encounter fewer conflicts when applications don’t need to be sent back and forth between ops and engineers before a launch; with SRE, everything works in sync. Moreover, as users continue expanding the way they use the Internet, SRE remains flexible and malleable to suit a variety of needs—setting your company up for success.
Thanks for reading and follow me on Twitter @SignifAICEO