Cloud Native ComputingDevOpsFeaturedLet's TalkOpen SourceSREsVideo

Site Reliability Can No Longer Be An Afterthought | Robert Ross – FireHydrant

0

Guest: Robert Ross
Company: FireHydrant
Show: Let’s Talk

Swapnil Bhartiya chats with Robert Ross, CEO and Co-Founder of FireHydrant. 

FireHydrant is a purpose-built tool for reliability. With FireHydrant, businesses can better manage and learn from incidents, work with their legal and marketing teams, and all the way through customer service. This is a tool for the entire reliability lifecycle.

Reliability has become an important business metric because when there are problems, it’s more than just a few systems down within a company; it’s about the brand and image of the company.

According to Ross, reliability is not an engineering metric, but a business metric. He says, “The engineers may own the reliability and they might be responsible for the reliability, but really there are so many stakeholders that come into play when there is, in fact, an outage.” To that, Ross includes the market and legal teams, as well as public relations. He offers an example of when Slack had to issue $8.2 million in refunds because of breaching their SLAs. This reliability issue required the legal team to get involved and then marketing had to step in to smooth things over.

One problem Ross points out is when a business insists a single team take ownership over a reliability issue. When that one team cannot be completely responsible for the reliability issue, it’s unfair to hold that team at fault. Ross states, “You can’t really have the marketing team restarting servers to bring the site back to life. That’s not realistic. It’s not a world that we can live in. So it does tend to lie on the engineers.”

Ross then brings chaos engineering into the reliability discussion when he says, “I think chaos is a great word and underused in our space.” He continues, “Another way to think of it is, and this is all in our minds right now, it’s almost like a vaccine, right? You’re injecting something into a system and you’re causing a reaction and you’re basically building up an immunity to things.” That immunity, according to Ross, is similar to building resilience into a system. He adds, “And that’s where the resilience engineering is part of that chaos; you’re really just trying to inject something in a controlled environment where you can maybe pull it back very quickly and see the effects of it.”

The future of reliability, according to Ross, is not to exist as an afterthought. Ross says, “Typically, in the past, in my experience, incidents are really the catalyst for change, maybe better monitoring, better alerting policies, better runbooks, whatever it may be. And now what we’re seeing in the market, in enterprises and companies of all sizes, is that they are beginning to think about reliability from the start.” Because businesses are moving into a more service ownership model, they’ll have to accept the “you build it, you own it” mantra. He also states that they’re seeing people adding SLOs and SLAs before anything is released. To that end, FireHydrant wants to help people be reliable from the start.

Summary for this interview/discussion was written by Jack Wallen


Here is the edited transcript of the interview.

Swapnil Bhartiya: This is your host Swapnil Bhartiya and welcome to another episode of Let’s Talk. Today my guest is Robert Ross, CEO and Co-Founder of FireHydrant. Robert, it’s great to have you on the show. Let’s start with some basics. What is FireHydrant all about?

Robert Ross: First of all, thanks so much for having me. So FireHydrant is a reliability tool. We’re building a tool that allows people to be more reliable across the entire organization. So managing incidents, learning from them, being able to work with the legal team, the marketing team, all the way through customer success. So we’re building the one tool for the entire reliability life cycle for every company of all sizes.

Swapnil Bhartiya: When you look at reliability, you look at it as a business metric or as an engineer metric? And what is the reasoning behind you preferring one over the other?

Robert Ross: When we have reliability problems, it’s much more from the angle of, the company is down. So that’s why we’re building the tools. And we like to say and almost preach that reliability is not an engineering metric. The engineers may own the reliability and they might be responsible for reliability, but really there are so many stakeholders that come into play when there is, in fact, an outage. For really bad outages, you may have the marketing team, the legal team, the public relations team get involved and it just is a big, big thing that everyone at the company actually becomes a stakeholder very quickly. In 2018, even Slack actually had to issue $8.2 million in refunds because of breaching their SLAs (Service-Level Agreements). And that was the legal department getting involved, marketing and smoothing over everything. And it was a whole thing that they had to undergo. And that’s really why we like to say reliability is not an engineering metric, it very much is a business metric.

Swapnil Bhartiya: With this whole cloud movement, we are trying to break silos and create new ones—Devops, Dev Sandbox, and SREs and all those things. So this responsibility of reliability remains with a small team or individual audit goal across the organization?

Robert Ross: You have to have some level of ownership over reliability within one team. If you try to own reliability, especially when it’s completely out of control of certain parts of the organization, then that’s unfair to that organization. You can’t really have the marketing team restarting servers to bring the site back to life. That’s not realistic. It’s not a world that we can live in. So it does tend to lie on the engineers. They’re usually the ones that are going to be on call, right? And they’re going to be the ones that are woken up. And what I see in companies though, (as they become more complex, they’re building more complex systems), is that we’re seeing much more of a service ownership model where it’s a “you build it, you own it” model.

And that means that the engineers that are writing the code are actually the ones being put on call. They’re the ones getting woken up in the middle of the night to respond to these incidents. And one thing that we also see is that we’re starting to see people shift away from computer vitals being the things that wake people up. And what I mean by a computer vital is, the CPU is very high right now. Maybe the memory is being consumed at an irregular rate, disc space as the last one is filling up pretty quickly. And the problem with measuring and waking up people on computer vitals is that it’s a very sure-fire way to burn people out because the website could very well still be operating.

The analogy I use is that if I go outside right now and I sprint down my block, my heart is going to start beating faster. That’s just what it does. It’s designed to do that. And the same thing happens with computers. Going back to another example, like Shopify, if they receive a ton of traffic, or if we receive a ton of traffic, our CPU is going to be spiked up. It’s going to be high and that’s not a bad thing. So what we’re pushing for in space and trying to educate more people on is that you should be alerting on symptoms, not on vitals. So for example, if my heart rate is up, that’s fine, but if my heart rate is up and I pass out and I fall, that’s when I should probably wake somebody up. And that’s a pretty big difference that we see.

Swapnil Bhartiya: When we talk about reliability, one thing is, of course, that you will take actions or react to when something happens, but you also try to prepare. That’s how you ensure reliability, right? And that’s where, in some cases, chaos engineering comes into play though. Some companies don’t like the word chaos engineering so they actually use the term reliability because the goal is the same, right? To test your systems though, the testing is not totally chaotic, it’s very well planned. And that also brings a lot of people from within the company together, though you can not prepare for things like this pandemic but there are a lot of things that you can prepare for. So can you talk about what role chaos engineering plays there, or how do you build the reliability there, within systems?

Robert Ross: Chaos engineering. I personally like the term, mostly because I think chaos is a great word and underused in our space, but I think another way to think of it is, and this is all in our minds right now, it’s almost like a vaccine, right? You’re injecting something into a system and you’re causing a reaction and you’re basically building up an immunity to things. So by injecting in something that’s going to force your CPU up high, artificially, you can actually build resilience against that. Now you can see, well, these systems begin to fail when our CPU is very, very high, and now you can start to build resilience, antibodies, like computer antibodies against that. And we can say, well, when this begins to happen, we actually can build in something that will maybe horizontally scale the system.

And that’s where resilience engineering is part of that chaos; you’re really just trying to inject something in a controlled environment where you can maybe pull it back very quickly and see the effects of it. And you don’t just have to do this for computer systems. And I think that’s one of the things, whenever you add the word engineering into it and infrastructure engineering, whatever it is, we think about computers. And chaos engineering, resilience engineering can be, actually, applied to much more than just injecting high CPU, network latency, whatever it may be. You can actually do it for processes too. We call them fire drills at FireHydrant, where we can just start an incident and see what happens. And we see, oh, well, people didn’t know how to update the status page or they didn’t have a login to update the status page.

That was one experiment that we ran at FireHydrant and very quickly revealed that, oh, nobody knows how to update the status page. Another one that we do in real life, all the time, is fire drills. We did them as kids. We do them maybe at our office buildings, where we artificially pretend that there is a fire to practice getting out of the building or going to a designated safe zone as fast as possible. That’s chaos engineering too. And we need to start thinking more about how we can add this to our process when we’re releasing new features, new anything, really. How can we exercise that functionality through some level of artificial planning? And it’s remarkable how much you can actually catch in that process.

Swapnil Bhartiya: Let’s talk about the future of reliability and what it means for FireHydrant and the enterprise?  So, first of all, if I ask you, looking at how companies are moving towards digital transformation, how they are embracing cloud, what role do you think reliability as a process or practice is going to play in future? And what does it mean for your company? And what does it mean for enterprises in general?

Robert Ross: What we’re beginning to see is that reliability is no longer becoming an afterthought. It’s becoming a very proactive thought. And chaos engineering, we were just talking about that, that’s a proactive thought about reliability. Typically,in my experience, incidents are really the catalyst to start change into maybe better monitoring, better alerting policies, better runbooks, whatever it may be. And now what we’re seeing in the market and in enterprises and, really, companies of all sizes is that they are beginning to think about reliability from the start. And not the start, like the first line of code, we’re talking all the way at the inception of a feature. And we’re thinking about, well, what are the SLOs for this feature? What is the threshold that we are providing an acceptable level of functionality to our customers?

And what that means is, people are moving much more to the service ownership model, and that entails the reliability, you build it, you own it. And then we’re also starting to see people adding SLOs from the very beginning and SLAs from the very beginning, before anything is released. And another thing that we’re starting to see is gradual rollouts too, which is another tactic for reliability. So things around maybe, feature flags, release flags, whatever that might be. And what that means for us as a business is that we’re planning, we’re building our functionality to not only be the reactive portion of an incident management process. We’ve built that. We’re beginning to think before the incident. How can we build the tools that allow people to think about reliability before reliability is a problem, right?

You want to be in shape. You don’t want to try to fix a problem. You want to just not have the problem. And so that’s one of the things that we’re starting to see and what we’re planning for in our product is let’s help people be reliable from the start. Our vision for our company is to envision a world where all software’s reliable. And that’s a lofty goal, right? We’re on a software podcast right now and I think everyone’s going, right? What does that mean? But we really do envision a world where all software can be reliable, as reliable as the electricity and water that come from our faucets.

And enterprises are feeling the same way. They are beginning to think that reliability is something that builds customer trust and customer trust is lost in buckets and gained in drops. So reliability is one of the most important things that these enterprises are now beginning to focus on and at the same pace that we are. And I think that for the next number of years and maybe forever, is that reliability is not going to be an afterthought. It’s going to be one of the first thoughts when releasing a feature to all of our wonderful customers out there in the world.

Swapnil Bhartiya: Robert, thank you so much for sitting down today to talk about reliability. There are different aspects, it’s more than just technology and also sharing the story of a FireHydrant and the future that we should be looking at, or the kind of future we are moving into and how reliability is going to play a very critical role. You’re absolutely right about it that it’s no longer going to be an afterthought. Just the way we have started looking at security as part of the developer’s pipeline, the same thing is going to happen with reliability. So thanks for sharing those insights and I would love to have you back on the show. Thank you.

Robert Ross: Thanks for having me.