Latest posts by Guy Fighel (see all)
- Turning the page: news & reflections - February 6, 2019
- Decisions: a deeper look - July 20, 2018
- Introducing Decisions: smart correlation for busy ops teams - July 16, 2018
Before jumping into this first post, I’d like to introduce myself, I am JP Marcos, and my co-founder, Guy Fighel. We are excited to introduce ‘Square Root’, a TechOps and DevOps focused blog, associated with our company SignifAI. We will look at opinions, ideas, discussions, tips, advice, etc. across these disciplines. The posts will mostly revolve around technical issues and technologies – which we expect to be a really exciting part of the blog – but we might also touch on trends, culture, management practices, anything newsworthy related to technical operations.
The Contributors Behind Square Root
The Square Root blog will have a number of contributors, myself included. Our collective careers in ops involve working in companies such as Vonage, Cloudera, Cisco, Google, and others. We have amassed a lot of experience and have scaled large infrastructures. Square Root will probably borrow a lot from our personal experience. Before we start geeking out, I’ll fill you in on a personal story, specifically on what got me interested in the world of ops, DevOps, SRE, reliability engineering and production engineering.
My First Downtime
Unlike my fellow contributors, I came into technical operations late in my career. I have held multiple C-level positions and had observed the role of technical operations in a company from a certain distance for a while. Initially, I didn’t pay a lot of attention to the Tech Ops team. I had my plate full and tended to hangout more with the developers focused on building the product. However, I realized pretty quickly that the entire business was dependent on the Tech Ops team and everything they manage.
On October 2010, I joined Tango, a then ten people or so start-up. The early days were exciting; we grew quickly and were able to build a user base of millions in just a few weeks. But that massive growth spurt came with major periods of downtime. One day, or rather night, in November 2010, the system went down and yet again, mid-afternoon. By 7pm, team members still had no idea what the reason for it was, so the extended ops team was gearing up for an all-nighter. The developers, however, were calling it a day and packing-up. The developers had the view that it wasn’t their fault, it was an ops issue. They were, in fact, mildly annoyed that the system wasn’t fixed yet. It was late in the day, but I decided to stick around to see what would happen. I also kept notes, to make sure I could analyze the situation after the event, some of which I will share here. While reading my notes, I now realize that the things which struck me as abnormal the first time I participated in dealing with an outage are things that, over time, have tended to appear more normal (or you lose interest in) through repetition.
Notes from the night of the outage
“Why do these guys seem almost happy to stick around the whole night fixing things?”
Hindsight answer: running operations is simply their job and responsibility, an all-nighter is also kind of a badge of honor in the life of an ‘ops guy’. The view is that ops is supposed to have bad moments like that, “tech ops done the traditional way mostly sucks”.
“Why are there so many assumptions on the whiteboard for the possible source of the issue?
Hindsight answer: too much information makes it hard to pinpoint something specifically relevant out of a sea of redundant data points from multiple sources.
“Why did these guys let the developers go home since most of their assumptions revolve around issues in the code?”
Hindsight answer: the seminal turf war between devs and ops was already there. This is a huge topic that’s started to change, but we’ll get back to that topic again when we write about DevOps.
“Why does it take so long to go through logs?”
Hindsight answer: they have no idea what they are looking for, they are hoping to stumble upon something.
“Massive New York pizzas late at night??”
Hindsight answer: the whole thing is a manually intensive process, and people need some sort of motivation (i.e. the pizza) to operate.
The outcome of the whole ordeal was, you guessed it, to reboot something as dawn was approaching. The team had long stopped being productive. The fix was temporary, the team hoped was that it would hold long enough until the ops team were back in action, at around 2-3 pm, and they could work with the developers. I’m sure many of you reading this and working in any of the areas of software delivery, will have gone through something similar.
Business depends on Ops
There were several obvious outcomes of the situation I described above and similar scenarios repeat these:
Downtime is painful
As an exec, I may not have fixed databases or parts of the system directly, but I was drawn into that area because downtime hurts a business in a huge way. The pain points are felt across revenue, user growth, and usage and impact a business’ bottom line.
Buffers, buffers, everywhere
Also, if it happens too often, your VP of Ops will have a tendency to add buffers everywhere, creating an excess of capacity which is not financially efficient.
If you leave me now
Then there is the HR cost. When things are managed as described above, for too long, you tend to churn a huge percentage of your ops folks. The result is you need to replace them with other professionals who are a lot more expensive because you are under time pressure to hire, and have no choice.
In addition, you have a major knowledge loss in the process. Unlike in the case of developers, where you keep the code if they leave, Ops folks tend to keep a ton of knowledge in their heads, which makes knowledge transfer tricky. You also have a pocket of Ops people with quite a poor work-life in your company who have a really easy way to voice their discontent (checkout Glassdoor) making your recruiting effort even more difficult.
Things have changed dramatically for the better since that night. The company reached four 9s SLAs and we’ll discuss how that happened in this blog.
But the questions from my first outage night highlight some of the broad themes that we will talk about in Square Root. Themes like, the need for DevOps and the benefits of DevOps augmentation through machine intelligence. Automation, monitoring, root cause analysis, preventive analysis, HA/SLA improvement, incident and problem management. Others such as, capacity planning, public and private cloud computing, micro-services, continuous integration and delivery, operational excellence, and of course new technologies and vendors related to all of the above.
The industry, capabilities and the working day (and night) of DevOps are evolving at a mind blowing pace. This is thanks to the great work of companies and projects like AWS, Google Cloud, Azure, Grafana, InfluxDB, Prometheus, ElasticSearch, Graphite, Datadog, New Relic, AppDynamics, or conferences like Monitorama, Velocity … and many more. We very much look forward to joining their ranks.
So let’s get started
In short, we are deeply interested in most things Tech Ops and DevOps. We are super excited to be in the middle of a massive transformation, driven by the imperative of velocity and scale. We are just as interested in the technology changes as the cultural transformations that are (or should be) taking place as well.
My full-time job is to build a company called SignifAI, which is currently in stealth mode. SignifAI will address some of the themes mentioned above. Along with our co-founder and CTO, Guy Fighel, we will selectively invite folks to join our beta program, together with prominent thought leaders already working with us. I’ll share additional details in a few months.
Thanks for reading, reacting, contributing and sharing.