SLAs And Four Nines Are Not Enough For High Availability In The Cloud

Guest: Dave Bermingham (LinkedIn, Twitter)
Company: SIOS Technology (Twitter)
Show: Let’s Talk

When most people think of high availability, they set four nines or less than five minutes of downtime every month as the baseline. But according to Dave Bermingham, Senior Technical Evangelist at SIOS Technology, high availability is more than that.

“When we look at the big picture of any service that needs to be available, there are many chains in that link. We call it the chain of high availability,” says Bermingham. “So you have to look at the big picture.”

He argues that counting on nines is really a measurement that you might be judged against, but really trying to guarantee a level of nines is almost impossible. Because there’s so many points in that availability chain that can be a single point of failure. Four nines is certainly a great number to be judged against and to strive for, but overall it doesn’t mean a lot to have just four nines for my database server.

Even with Cloud SLAs (Service Level Agreements), one can’t be fully rest assured as most cloud providers offer four nines on compute, which is only one part of the availability chain (along with network, storage, and the hops between). Bermingham warns, “There’s a million points of failure. So, trying to think that my cloud provider offers four nines so I’m covered, you’re kind of fooling yourself there. You have to look at the big picture and do what you can to identify those points of failures, to minimize the potential points of failure and to have a recovery plan, should something happen.”

When considering High Availability/Disaster Recovery (HA/DR), Bermingham believes the thing that causes the most visible downtime is human error. Bermingham also suggests that authorization and access to the system should also be restricted to reduce the point of failure. “You should only give access to those who absolutely need access to it and you should also ensure that they are highly trained and that you have all the things in place to help minimize potential oops.”

Another important tip Bermingham offers is to make sure your storage is highly available. To that, he says, “You’re never going to have more availability than the weakest link in that chain.” Other tips include having the ability to rapidly recover from corruption events and making sure you don’t have nefarious people breaking into your network.

The summary of the show is written by Jack Wallen

Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya and welcome to TFiR Let’s Talk. And today we have with us, once again, Dave Bermingham, Senior Technical Evangelist at SIOS Technology. Dave, it’s great to have you back on the show.

Dave Bermingham: I’m glad to be here. Thanks for having me, once again.

Swapnil Bhartiya: Today’s topic is something which is close to my heart as well, which is eliminating weak links in the chain of kind of high availability. What I’ve seen is that even here, whenever we talk about application high availability, most of the time, the focus shifts toward those lines, two line, three line, four line, five line. But if I ask you, is this really the factor that we should be looking at?

Dave Bermingham: When you’re talking about high availability, people think about four nines being the baseline for high availability, about less than five minutes of downtime per month. And you could be trapped in that, if you think that my SQL server is highly available. Four nines of availability, that’s great. But you look at the big picture of any service that needs to be available, there are many, many, many chains in that link, we call the chain of high availability. So you have to look at the big picture. So counting on nines is really a measurement that you might be judged against, but really trying to guarantee a level of nines is almost impossible. Because there’s so many points in that availability chain that can be a single point of failure. So it’s a great number to be judged against and to strive for, but overall it doesn’t mean a lot to have just four nines for my database server. There’s so many other things involved ensuring the application is highly available.

Swapnil Bhartiya: Right. So if I have a cloud SLA, which is often like floor lines, that means that I should not just rest assured that, hey, I have four lines, I don’t have to worry about anything.

Dave Bermingham: Yeah. I mean, all those SLAs, you got to read the fine print. Most of the cloud providers will give you four nines on compute. Compute is just, that’s one part of the availability chain. So you have compute, you have network, you have storage, and all the hops in between and what are my applications? What are the endpoints of my applications? There’s a million points of failure. So, trying to think that my cloud provider offers four nines so I’m covered, you’re really kind of fooling yourself there. You have to look at the big picture and do what you can to identify those points of failures and to minimize the potential points of failure and to have a recovery plan, should something happen.

Swapnil Bhartiya: As you mentioned it could be storage, it could be network. Let’s just go a bit deeper into the weeds and look at the whole infrastructure, which starts with server. As much as we’d like to talk about functional service and everything else, but the fact is there are servers running somewhere for everything else. Can you talk about, as you explain, there are other things also. What are the things that are important? What are the things people should consider?

Dave Bermingham: Well, beyond servers and the storage and the networking and all the components you put together, especially when we’re talking about cloud, you put all these components together. One of the things that really, in my opinion, has caused the most downtime or most visible downtime that we’ve seen recently has been human error, right? Someone, big Facebook outage, I think everyone was familiar with that. If you read the postmortem of that outage it was someone uploaded something to their routers that basically took them offline and actually brought their internal DNS servers offline. So they couldn’t even easily fix the problem because they were locked out of their own systems. They couldn’t reach their own remote data centers. They were physically, the security system locked them out of the room that they needed to get into to fix the problem.

That’s a big part of it, is making sure that you have limited access, only the people that absolutely need access to these systems have the access, make sure they’re highly trained, highly skilled, and have things in place to help minimize potential oops. Things to make sure that you don’t have those accidents occurring. AWS just had this big outage. We don’t have the postmortem yet on what happened there. But if I were a betting man, I would say someone goofed somewhere along the way, because if Amazon and Facebook who have all the resources in the world to make sure they have highly redundant systems, if they can still have outages, I’m guessing it’s most likely some kind of user error.

Swapnil Bhartiya: There’s one point that I do want to touch a bit about is storage, since you also mentioned earlier. How much importance should be given to storage in a high availability conviction, because the fact is applications do go and come back. It’s the data that is what can break and make things. Talk about the importance of storage.

Dave Bermingham: I mean, storage needs to be highly available, just like your compute. If your compute is 99.99, but your storage is only 99.9, well guess what? Your application is only 99.9. You’re never going to have more availability than the weakest link in that chain. And the more weak links you have the lower that number goes. So making sure your storage is highly available is just as important as making sure your compute is highly available. But the other thing is making sure that your storage is recoverable. If you have data corruption, but it’s highly available, you’re still, you don’t have availability. So you need to have backup plans in place, data protection plans in place, so that you have the ability to rapidly recover from corruption events, or the other side of the coin, besides human error, you’re talking about security. Making sure that you don’t have nefarious people breaking into your networking and doing the crypto lock and asking for tons of money to unlock your data. You have to consider all of that. That’s all part of the availability plan.

Swapnil Bhartiya: If I ask you to, if you can summarize that looking at all these weak points, we cannot even cover all of them in this one discussion. But let’s look at the solution part of it. Do you have any quick playbook that you can share where people should just start so that even if they cannot eliminate all those weak points, at least they know about them and they can do whatever they can to address those.

Dave Bermingham: That all starts with your business continuity plan. You have to identify where are potential problems that could disrupt your business, whether it’s things we’ve talked about from infrastructure availability, to natural disasters, to anything. You have to use your imagination. What could possibly happen? You start with your business continuity plan, that drives your disaster recovery plan, which is more of the playbook for the IT infrastructure people. We know these systems need to be highly available. What are we going to do to make them highly available? That kind of drives everything you need to do. And like I said earlier, then you look at not only the servers and the storage. I bought redundant everything, we’re good, but what about the application? What are you doing to make sure that that’s highly available? Are you doing clustering or what? What’s going on there?

And then all the components from the beginning to the end, any failure along the way, identifying them, doing what you can. And then again, training your employees, making sure you have access control so that not everyone can do anything they want. And then one of the most important things is the communication plan. So if something goes wrong, how am I communicating with my employees? You can’t assume email is available. Can’t assume even if you use some software as a service, instant messenger type thing, that might not be available. So the employees need to know, hey, if X, Y, and Z are down, what’s my fourth layer of communication? That all needs to be written in your disaster recovery plan.

Swapnil Bhartiya: Dave, thank you so much for, of course, not only sharing what the potential weak points are, but also sharing some tips or suggestions that people can, things that they can do. Thank you for sharing, of course, your insights there. And I look forward to talking to you again soon. Thank you.

Dave Bermingham: Thank you.

Read Full Transcript & Technical Deep Dive

SLAs And Four Nines Are Not Enough For High Availability In The Cloud

What To Expect At Dynatrace Perform 2022

Challenges Of Stateful Workloads In Kubernetes: Meet The Data on Kubernetes Community (DoKC)

What To Expect At Dynatrace Perform 2022

Challenges Of Stateful Workloads In Kubernetes: Meet The Data on Kubernetes Community (DoKC)

You may also like

AI Writes Code, But Who’s Managing the Infrastructure? GitOps Has the Answer | Hong Wang, Akuity

Multi-Cloud Fragmentation Is Creating Governance Blind Spots | | Dirk Alshuth, emma | TFiR

AI Infrastructure Complexity Is Costing Enterprises Millions—Mirantis Has a Fix | TFiR

Why anynines Rewrote Stratos UI from Scratch: CF AppStage on Cloud Controller v3 | Julian Fischer

What IT Teams Should Evaluate in HA Solutions Beyond Cloud Providers | Philip Merry, SIOS Technology | TFiR

Your HA Cluster Has Blind Spots. SIOS’s Health Check Finds Them Before You Face Downtime.