Welcome to another post in our ongoing series where we pose a dozen questions to an engineering, DevOps and/or site reliability team leader to get a better understanding of how DevOps is being implemented in the real world and the challenges that still remain.

About David Henke

David has over 35 years of experience in the software/Internet field as a senior manager, architect, programmer. He managed both Engineering and Operations at LinkedIn, Yahoo! and AltaVista. He’s also the founder of two successful software companies that were both acquired.

How do you define DevOps?

“I prefer Site Reliability Engineering, because DevOps personnel are ALL engineers. At Linkedin, our teams sat directly with the software engineers, product folks, and QA to build services. The SRE’s (site reliability engineers) built a sophisticated set of tools and processes to run LinkedIn with millions of metrics per second.”

DevOps has been on the scene for about a decade, what are some of the promises that are still not being fulfilled? Why?

“Improving the signal-to-noise ratio is a common complaint about existing metrics and alerting systems. For example, if 4000 network alerts actually reduce to a single network failure, it would be great to alert and remediate on that one failure. This is one of the reasons I like tools like SignifAI to reduce alert noise.”

How big were the teams responsible for system availability?

“Well over 50.”

In your experience, was there a clear demarcation between those who developed software, tested software and those charged with running it in production? Was there any overlap in responsibilities?

“Once again, while there was a clear demarcation for software developers, product, QA, and SRE. The teams sit and operate as ONE team. We win together and on occasion lose together, but always as a team.”

What are some hallmarks of companies who are successfully adopting DevOps?

“’What gets measured gets fixed!’ Companies that are constantly integrating tools and processes to honor this are doing DevOps well. Getting to the root cause and simplifying the mean-time-to-detect, mean-time-to-debug, mean-time-to-remediate is always top of mind. Automation is key.”

To what degree are companies leveraging microservices and containers? What unique challenges does it pose?

“Linkedin has decoupled most of their systems so that they can isolate problems and fail gracefully. The problem with most companies is that they do not architect in this fashion up front, and eventually have to pay off the technical debt.”

Assuming Linkedin leverages a mix of homegrown, open source, on-premise and SaaS tools, what was your criteria for deciding if you should build or buy a tool?

“Linkedin does use a mix of homegrown, open source, and 3rd party tools. We agreed to write our tools in Python, and this makes for a very robust development environment. Our SRE’s are engineers/programmers. The criteria for me is simple…If it critical for us and no tool can be bought, we will build it ourselves. However, I do not believe most companies are in a position to do this.”

What are some of the tools your company employs when it comes to development, continuous integration, configuration, deployment, monitoring, incident management and collaboration?

  • Build Pipeline: IntelliJ, PyCharm, PCX, Gerrit/SVN, ORCA, LDS, UpDog and Artifactory
  • Deploy Pipeline: LID, SaltStack and LPS
  • Monitoring, Alerting and Remediation: inGraphs, AutoAlerts, IRIS and Nurse
  • Misc: CRT, MEGA and lid-client

If you want to get more details on the above, check out the “Everyday is Monday in Operations” series by David and Benjamin Purgason on the Linkedin Engineering blog.

What are the more likely benefits you see in applying AI and machine learning to DevOps and site reliability?

“Again, reducing the complexity and time to debug and remediate is key. Collecting information from multiple sources and coming up with answers in a timely fashion will reduce outage times dramatically.”

Which scripting languages do you find to be the most important for a DevOps engineer to master and why?

“Python for programming (powerful, robust). UNIX/shell scripting is always required.”

What have you found to be the ideal personality traits and temperament for someone who has to be on-call, understanding that outages can often come with big consequences?

“Cool, calm, collected. Don’t assume anything. Work swiftly, but deliberately.”

What advice or lessons learned do you have for DevOps or site reliability teams that have to support a company that is growing very fast?

“Communication, communication, communication. Training, training, training. Daily stand-up. Culture of site-up.”

Bonus Question: What’s some advice you can give a DevOps engineer or SRE to prepare for interviews?

“Know your stuff. Understand the technical is necessary but not sufficient. You need to be part of the team. You need to be able to learn.”

What’s next?