Can You Really Tame Cloud Complexity? Yes, Says Rob Hirschfeld

Guest: Rob Hirschfeld (LinkedIn, Twitter)
Company: RackN (Twitter)
Show: Let’s Talk

Can complexity, within the cloud-native and container spaces, be tamed? Rob Hirschfeld, co-founder and CEO of RackN, makes it very clear it has to be. On this subject, he starts by saying, “We keep adding more and more things. And when we do that, we’ve added to the complexity budget of our overall system.” Hirschfeld continues, “So when we think about how we build things in cloud-native spaces, we have to look at ways to manage the complexity in a comprehensive way.”

RackN takes the science of managing and taming complexity and builds it into their products and builds infrastructure pipelines that are designed to tame complexity. For Hirschfeld, the key is “managing through that complexity in a repeatable way.”

RackN even has a list of items that serve as a way to tame the complexity of infrastructure. Hirschfeld offers up that list with, “It’s focusing on intents, and using that as your abstraction boundary. It’s keeping tools inside their lane and using things where they excel, and then, finally, avoiding a single source of truth and the challenges that come from having multiple sources of truth.”

Hirschfeld also brings up the concept of coupling. “We always joke about DNS being the root cause of every problem. And that’s because every system is coupled to DNS in some way because it needs to look up names and find other machines.” Because of this, Hirschfeld believes we need to identify places where we have introduced coupling between systems. To that, Hirschfeld adds, “It doesn’t mean that we should eliminate coupling, but it means that we need to identify what’s coupled.” Hirschfeld clarifies by saying, “It means we have to do it very deliberately and be able to track how those systems are interconnected together. So by being deliberate in how we’ve coupled systems together means that we can manage the complexity.”

Another complexity involves intent, to which Hirschfeld says, “What we find is that intent is actually a really good abstraction boundary. You build an API that extracts the actual thing you’re building. But that ends up being incredibly fragile, and it exposes a lot of complexity in how your systems get built. And what we found is that if you can express the goal of what you’re trying to build, then when you build the abstractions, you can actually abstract away those different approaches.”

The summary of the show is written by Jack Wallen

[expander_maker]

Swapnil Bhartiya: Hi. This is your host Swapnil Bhartiya and welcome to Let’s Talk About Infrastructure as Code, our new show, and my special guest for this show is, once again, Rob Hirschfeld, co-founder and CEO of RackN Rob. It’s, once again, great to have you on the show.

Rob Hirschfeld: It’s a pleasure to be here and talk about something so critical to helping people build infrastructure, and maintain it.

Swapnil Bhartiya: Last week was KubeCon. And whenever we talk about KubeCon, Kubernetes cloud, one thing that we all talk about is complexity. It’s a busiest space. It’s not that the complexity has been created on purpose. If you do look at… In fact, there are so many things, there are so many moving parts. It’s just like a car, right? So, today we want to talk about this complexity and just go back to the analogy of a car, or a motherboard of a CPU. You are not going to take things away. Those pieces, those moving parts, will be there. What we can do is how to make it simple for users. You’re not eliminating complexity, you’re making it easier. So, in a way, we are looking at how to tame complexity. So, if I ask you, number one is, can you tame it? And if you can tame it, are there ways to tame it or we have to just deal with it?

Rob Hirschfeld: Definitely we have to tame it. The challenge that we’ve had with complexity is that we keep adding more and more things. I’ve actually been talking about something called a Jevons paradox of complexity, because we’ve made it so that it’s so easy to add incredibly complex functionality that we don’t realize, we just hit an API.

And when we do that, we’ve added to the complexity budget, if you will, of our overall system. So, when we think about how we build things in cloud-native ways, or really any infrastructure. A car, anything like that, we have to look at ways to manage the complexity in a comprehensive way, right?

Very methodically understanding where we’re taking on complexity. What do we do to manage it? Because we’re not going to eliminate it. We’re not going to throw everything out and go back to a bicycle, right? We’ve got to have those improvements and efficiencies that come from the systems that we’re building.

Swapnil Bhartiya: Now, one of the goals or agendas of this show is also to find some solutions. Not exactly a playbook, but share something like that. So, if I ask you, as you were saying, there are ways. Do you have a playbook that, hey, this is how you tame it? And if you do, what is it?

Rob Hirschfeld: So, this is part of the science that’s been going into building Digital Rebar with RackN. When we look at the space of data center and infrastructure management, it decomposes into very well-understood concepts, where we can actually take the science of managing and taming complexity, and actually build it into products, and actually build infrastructure pipelines that are designed to tame complexity without actually having customers give up multiple vendors, or multiple clouds, or different operating systems. And that’s the key here. Is managing through that complexity in a repeatable way.

Swapnil Bhartiya: Do you have a list of items that, hey, this is the way you tame your infrastructure? If yes, what are those items?

Rob Hirschfeld: We really do, and it’s important to think through these four ways that we have consistently been able to tame complexity. And they’re pretty simple. It’s things like looking for coupling between systems.

It’s focusing on intents, and using that as your abstraction boundary. It’s keeping tools inside their lane and using things where they excel, and then finally avoiding a single source of truth and the challenges that come from having multiple sources of truth.

Swapnil Bhartiya: Excellent. Now, once again, let’s break these down and go a bit deeper into each. Let’s just pick coupling. What exactly do you mean by, look for coupling? What do you mean by that? And how do you plan to decouple things if they’re coupled?

Rob Hirschfeld: So, coupling is often confused with complexity. So, a system can be very complex but be self-contained and easy to manage. Coupling comes in when a system has dependencies that rely on other systems, and in a lot of cases, that you might not be even know.

We always joke about, DNS is the root cause of every problem. And that’s because every system is coupled to DNS in some ways, because it needs to look up names and find other machines.

So, when we look at complexity, the first thing we need to do is identify places where we have introduced coupling between systems. And it doesn’t mean that we should eliminate coupling, but it means that we need to identify what’s coupled.

It means we have to do it very deliberately and be able to track how those systems are interconnected together. So, by being deliberate in how we’ve coupled systems together means that we can manage the complexity.

Once again, we’re not eliminating it, we’re just managing it so that coupling becomes a source of risk, a source of management challenge, a source of dependency graphs, and those are all things that contribute to complexity.

Swapnil Bhartiya: Right. And I think the next point is about focusing on intent, which is more or less like you should know what you’re trying to actually achieve versus getting into the bandwagon of, hey, everybody is using that XYZ technology. So, let’s just talk about focusing on the intent part.

Rob Hirschfeld: Focusing on intents is something that, happily, has gotten more common as people look at the way Kubernetes is built, where you ask for your objective. But you code that, and Kubernetes speak into YAML.

What we find is that intent is actually a really good abstraction boundary. So, in the past we’ve looked at building a layer-cake of different types of abstractions, and those abstractions are the APIs of, “Oh, I need a VM, I need a physical machine, I need a network.”

And you build an API that extracts the actual thing you’re building. But that ends up being incredibly fragile, and it exposes a lot of complexity in how your systems get built. And what we found is that if you can express the goal of what you’re trying to build, then when you build the abstractions, you can actually abstract away those different approaches.

And then it becomes a much more consistent way to interface with the system. So, instead of asking for a machine with this much RAM, and this much CPU, and this much disk… It’s very specific because you’re dealing with all these different abstractions.

If your intent is to have a machine of a certain profile type, then your request is very simple and it doesn’t matter if it’s Amazon, Google, Microsoft, bare-metal, VMware, right? All those changes can be abstracted away because you’ve consolidated down to, “I need a machine that matches this profile.”

So, moving from the specifics of what you need into the intent of what you’re trying to build, really changes the way you build infrastructure and automation in remarkable ways.

Swapnil Bhartiya: Right? I think this goes back to the very simple point in our lives also that focus on what you try to achieve versus, “Hey, I need that tool, that tool, that tool,” because then it becomes a spirals into a totally different back hole.

Rob Hirschfeld: It’s a classic case, right? If you ask somebody for a very specific thing, they might deliver it to you and it might not be what you actually need to get your job done, when you could have just said, “I need help moving my house,” instead of help carrying a box, right? Still have to carry the box, but you’re asking for the right type of help. And it’s exactly the same thing with infrastructure.

Swapnil Bhartiya: Yeah. But then after that, then you look at the whole pile and you’re like, “All I wanted was this.” Anyway, back to the third point that you mentioned was, keep tools inside their lanes. This is something interesting because we do see a lot of tools can do a lot many things or once again, it goes back to the previous point you made, dependencies. So, I want to understand, what exactly do you mean by that? And can it actually be achieved in the cloud edit word?

Rob Hirschfeld: It is a really serious problem. We have a tendency, especially because it’s so hard to get tools approved or buy- in, that once we have something that works, we keep expanding, inflating it like a balloon and pushing it out of what it does well.

And I see that happening with configuration or provisioning tools. I’ve seen it happen with whole platforms where we are like, “Oh, this platform for container management is amazing. I should use it for VMS and I should use it for physical.”

And some point the abstraction from what that tool did well, doesn’t match. So, one of the places where we get into complexity is, it’s going to sound counterintuitive, if we try to reduce the number of tools by making the tools do things they’re not good at, we actually make it more complex.

We’re pushing things into areas where they’re fragile, or where we have to do strange actions to fit the schema in this. And you actually can manage complexity. Remember, not eliminate, but manage complexity by keeping things in the place where they’re strong.

And that allows you to then focus on working within the parameters of a system that does exactly what you need it to do, and then adding another system next to it that does what it does well.

Now, we do have a separate problem which is building an infrastructure pipeline where we connect all those pieces together. It’s totally necessary to manage complexity but, it’s easier to do it if each tool is working within its own area of expertise and not having to be nudged into behaviors that it doesn’t do naturally.

Swapnil Bhartiya: And that is very smooth transition to the last point, which is a word single source of truth which, once again, goes back to the point of every tool is talking to everybody there. Once again, you do need it. It’s an API-driven word, but why do you have to contain them? And second point is how do you achieve that?

Rob Hirschfeld: It’s a really serious challenge because most tools are built on the assumption that they are in charge of their own destiny, right? They understand everything they need to do, and you’re going to provide them with all the information up- front.

They’re going to collect that state and then make things go. My favorite example of this is Terraform, which builds a state file and if something changes outside that state file, it assumes that was a mistake and it will go try to fix it.

Swapnil Bhartiya: Right.

Rob Hirschfeld: Or if you corrupt that state file, you’re in really serious trouble because that’s the only place to get the information that was in that state file. And with one system it’s okay, but when we’ve built all of our tools this way, it really becomes a challenge to try to make them work together and integrate how pieces work.

And we see this happen once you go cloud to cloud, or hardware type to hardware type, or network to network. The fact that you have to negotiate through all of these sources of truth and then if there’s something that changed outside of them, becomes a real problem.

The challenge with this is that there will be no single source of truth for your infrastructure. And while it sounds like that is more complex, acknowledging it radically increases the manageability of the system, because then you’re building tools or you’re looking for tools that expect to have updates outside of their domain.

So, they expect to take data in incrementally. They expect to hand data off incrementally. And that ability to be flexible about what the truth is and understand that any one silo doesn’t have all the answers, which they never do, it actually improves the manageability of the system.

And then you can say, “Well, wait a second. I’m going to build in a request to get the information I don’t have, and I will build in a way to pass off the information that I’ve learned.”

And the more you can share those pieces together, the more resilient your system is, and the more helpful decoupling. So, if your systems are able to hand data back and forth automatically, then the coupling of those systems is reduced.

So, all of these components that we’re talking about actually fit together in a really important way and single source of truth is possibly the anti-pattern that we see in a lot of tools and systems. That we really have to work around if we’re going to manage complexity.

Swapnil Bhartiya: Oh, thanks for explaining this point. If I ask you to now just summarize everything. Also, one more thing is that whatever things we talked about, it’s more about culture. I expect people, part of the process. Part of it. It’s not the tools you’re talking about. And that’s where another piece of complexity, is the people have do things sometimes they have platforms or there are solutions that make it easy. So, can you also talk about how people can really achieve this taming so that you don’t have to go and talk to everybody, “Hey, don’t do this, don’t do that.”

Rob Hirschfeld: You’re entirely right. Everything we’re talking about plays out in people and process just as much as it does in the technology and we built product Digital Rebar specifically to address taming complexity in infrastructure and crossing all these silos and helping teams work together to build an end-to-end solution.

And what we find in doing that, is it really is idea of an infrastructure pipeline. So, very much in this DevOps process model of connecting things together, letting teams share information, letting things move smoothly between different silos and different expertise.

And that is true as much in building the people and the teams, and getting them to work together, but it also is something that the technologies have to do. For us, it’s effectively a missing glue layer that allows all of the tools to work in concert and get things done.

Swapnil Bhartiya: Rob, once again, thank you for taking your time regarding the second episode of this series about taming the infrastructure. As usual, I look forward to our next episode. Thank you.

Rob Hirschfeld: Thank you.

[/expander_maker]

You may also like

Open Platform for Enterprise AI (OPEA) aims to foster collaboration in Enterprise AI

Why AWS backs Valkey, an open source alternative to Redis | David Nalley

LF Energy leads digitalization efforts to tackle decarbonization challenges

Carbon Data Specification Consortium helps drive climate solutions with carbon data standardization

Tackle data complexity with Hasura v3

Acorn Labs’ GPTScript aims to redefine coding for AI applications