Guest: Rob Hirschfeld (LinkedIn, Twitter)
Company: RackN (Twitter)
Show: Let’s Talk
It could be a very complex and herculean task for organizations to rebuild their data centers from scratch, more so as it involves optimizing the entire computing infrastructure as one – compute, storage and network resources – to support the growing business needs. As challenging and inconceivable as it may appear, in reality, there are lots of cases where companies may have to repave their data centers. As Rob Hirschfeld, CEO and Co-Founder of RackN, puts it, “The examples of actually needing to be able to, what our largest customer calls, repave their data centers are very real and very immediate. And as you inferred, it’s a hugely challenging problem.”
There are, for instance, financial services customers that have regulatory requirements and also telcos that need help in building up cell site automation. Hirschfeld goes on to talk about the oddities of repaving a data center: “You were saying about being able to bring up their infrastructure in a new location in a week, and not being able to do it cost them millions of dollars a year, but you don’t have to go very far to look at cases like Colonial Pipeline with a ransomware attack or some of the edge cases that we’re talking about, where people have real infrastructure they’re trying to bring up and make it work with no technicians or people involved.”
Interestingly, the hardware side of rebuilding data centers starts for customers even before the hardware arrives on site. Here’s how Hirschfeld explains it: “One of the things that’s been very important for our customers and important in this repaving mission is that we import shipping manifests. So the information about what the system is going to be is actually encoded in the system even before the machines are installed on site. And if you’re thinking about repaving a site or just normal onboarding servers, both are critical because the primary challenge to deliver here is not whether or not you can get the servers, although that’s a critical thing, but did the server ship correctly, or do they have the right configurations?”
So how does RackN really approach this problem? As it’s an infrastructure-as-code pipeline challenge, RackN helps customers establish a standard pattern that works for all of the data centers for their infrastructures and is later put through a test process. “They’ll validate if they have new servers or new hardware types coming in, and work with us in the lab before the servers are even purchased. And they’ll verify that those systems are going; they’ll qualify alternate suppliers so they can make sure that if they can’t get one server vendor, they can switch to another server vendor dynamically,” quips Hirschfeld.
One of their flagship tools is Digital Rebar, which creates reusable, standardized processes for platform and infrastructure teams, enabling both self-management and control at scale. “Digital Rebar is designed as the seed of a data center. So we will literally be in a situation where we are the first thing that’s installed and there can be no other dependencies,” adds Hirschfeld.
One of the interesting things about infrastructure-as-code pipelines is that there’s a whole bunch of technical pieces getting things to work end-to-end, but fundamentally it breaks back down to helping teams collaborate. To sum it up in Hirschfeld’s words, “Repaving a data center is about that collaboration, making all those teams do that work together and then rehearse it, repeat it and standardize it. And that’s really where the value’s coming from.”
Topics covered in this interview include:
- Are there really cases where companies need to ‘repave’ their whole data center? If yes, what are those?
- How do companies approach it realistically? Discussing the hardware side of repaving data centers.
- At what stage does RackN get involved? How does the company help speed things up?
- Hirschfeld gives a comparison in terms of how Digital Rebar helps customers, especially in saving time.
- Who all gets involved in rebuilding a data center? How do the teams work together?
- There may also be cases where organizations don’t have to rebuild their whole data centers. There might be a few components they might have to add, some regulatory lead or so. How much is the RackN Digital Rebar capable of handling that?
The summary of the show is written by Monika Chauhan
Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya and we have today with us, once again, Rob Hirschfeld, CEO and co-founder of RackN. And Rob, first of all, happy new year.
Rob Hirschfeld: Happy new year. It’s going to be an exciting year, already off with a bang.
Swapnil Bhartiya: Right. It’s a new year. We hope that this pandemic will be gone. It’ll not be gone, but we will try to resume our lives. I don’t know what that means anymore, but it does mean a lot of things in person. And when we do talk about the person that kind of aligns very well with the topic of today’s discussion, which is about rebuilding data centers as fast as possible if needed. But when we look at data centers, we are not looking at setting up a simple machine. We have to roll in racks, storage, networking. It’s a big herculean task, but are there any cases where companies/ organizations might have to just rebuild everything for whatever reason it could be? Of course, I don’t know. It could be regulations. There’s so many reasons that they need to. So what I want to understand from you is that in reality, are there cases where companies may have to rebuild their data center? And if yes, what are those?
Rob Hirschfeld: Oh, there’s a lot of cases where it happens. For financial services customers like we have, they actually have regulatory requirements. You were saying about being able to bring up their infrastructure in a new location in a week, and not being able to do it cost them millions of dollars a year, but you don’t have to go very far to look at like colonial pipeline with a ransomware attack or some of the edge cases that we’re talking about, where people have real infrastructure they’re trying to bring up and make work with no technicians or people involved. So like telco use cases where we’re helping telcos build up cell site automation. The examples of actually needing to be able to what our largest customer calls repave their data centers are very real and very immediate. And as you inferred, it’s a hugely challenging problem.
Swapnil Bhartiya: When we do talk about rebuilding or repaving, it is a huge challenge. So how do they approach it realistically? I mean, the hardware side of education can be easy because they already have the hardware they need, or they also have to bring in new hardware because sometimes a lot of… Let’s talk about just windows alignment. It needs a TPM or whatever it needs. So sometime you do need a specific hardware as well. So let’s start with the hardware side of the story.
Rob Hirschfeld: Happy to. So there’s, the hardware story interestingly enough starts for our customers even before the hardware arrives on site. So one of the things that’s been very important for our customers and important in this repaving mission is that we import shipping manifests. So the information about what the system is going to be is actually encoded in the system even before the machines are installed on site. And if you’re thinking about repaving a site or just normal onboarding servers, both are critical because the number one challenge in time of to deliver here is not whether or not you can get the servers, although that’s a critical thing, but did the server ship correctly, or do they have the right configurations? Do I know that they’re put in the right spots? Any one of those issues can cause the whole system to fall apart, right. We’ve had cases where somebody put in the wrong gateway address and made all of their out-of-band management on accessible because they couldn’t talk to their out-of-band management servers.
And the thing about being able to check and qualify against a manifest means that you can very quickly identify those types of problems. And the repaving is powerful, both from that immediate boot up. But when you find a problem, like somebody put in a wrong IP address, you can reset it and then repave it really quickly. Matter of fact, one of the things that’s most powerful here is that once we’ve built repaving capabilities for customers, for onboarding or regular infrastructure, they actually then use that same process for ongoing day to maintenance. They just rebuild racks of servers when they’re ready to do a conformance check or reset the environments, it’s incredibly powerful.
Swapnil Bhartiya: And as you said, it actually starts before the hardware is procured or ship. At what stage does RackN get involved? Of course, I am assuming that it starts from the very beginning, but I just want to understand how the processes work and how you actually help companies with repaving it.
Rob Hirschfeld: So the thing about the way RackN approaches this problem, a lot of people would think of us as provision pixie, boot install and OS, and sort of get that process going, but the way we approach the problem and the reason we start even earlier than that is because it’s really an infrastructure as code pipeline challenge. And so what we’ll do is we’ll help customers establish a standard pattern that works for all of the data centers for all their infrastructures. And then they’ll put that through a test process. They’ll validate if they have new servers or new hardware types coming in, they’ll work with us in the lab before the servers are even purchased. And they’ll verify that those systems are going, they’ll qualify alternate suppliers so they can make sure that if they can’t get one server vendor, they can switch to another server vendor dynamically.
All these things are critical both to that first day bring up process, but also just ongoing supply chain challenges. But from there, Digital Rebar is designed as the seed of a data center. So we will literally be in a situation where we are the first thing that’s installed and there can be no other dependencies. So you literally have to be able to say, “Install a very small footprint, completely self-contained system, every piece of automation, every binary, every requirement has to be basically installable on a USB drive.” That’s the requirement. And so you can show up with a USB drive, plug it into a server, let the system build up that data center seed. And then from there, every other server comes in, runs through an automated process, builds clusters, right. It just works completely end to end. And it’s amazing to watch this operated scale because our customers will literally seed new data centers in some cases remotely.
So it’s become pretty common to us. Sometimes a little scary because they’ll provision new data centers from across the globe. And we get calls that say, “Hey, we’re having trouble provisioning, this new Singapore data center.” We’re like, “Where are you provisioning it from?” And they’re like, “Well, from New Jersey”, and we’re like, “There’s a lot of latency involved in that.” They’re like, “Yeah, but it’s working except this one task times out”, that’s the type of bootstrapping environment that you get into. And then when you translate that into what you can do from an edge perspective where you don’t want any infrastructure on site and you need to do something completely remote, our telco customers are like that where they have all these points of presence. They need to install and run servers, but the control plane for that’s stretched. So they’ll do it from a regional center rather than from the onsite. They don’t want to put the extra gear there.
Swapnil Bhartiya: Can you also kind of give me a comparison in terms of how Digital Rebar helps customers, especially in saving time and of course, as you mentioned, it depends on where they are and they will be doing it. So it doesn’t matter if the customer is always right. It doesn’t matter where they’re running it, how they’re running it and if you could, as I said earlier, the first point is to compare how much time it takes to get them ready with the seed and of course when the hardware is in place, they can replicate it.
Rob Hirschfeld: Yeah. I’m glad you asked that. The results are dramatic. The processes that we’re replacing for our customers were literally weeks of time. And what’s really impressive here is that in a lot of cases, gear sits unused in a loading docker [inaudible 00:08:18] until the teams are available to do the different operations. And there’s a lot of time where the individual tests don’t take very long, but connecting everything together is incredibly time consuming, that’s the delays. So what we’ve been able to do is, one, optimize the individual steps. So we’re really good at making small steps go faster and running steps in parallel that normally would be serialized, like getting credentials while we do other work. But the big savings come from just eliminating all of these manual steps. And so you can be confident that when you turn on a server and plug it in, it’s going to go through an automated process and it’s going to run through the diagnostics and burn ins and checks and everything’s going to come online and you can get notified or connect in additional systems to make that process run end to end. And that’s really important.
The other thing that’s critical here is that it needs to be reliable. And so we’ve had cases where customers were running configuration or re provisioning processes that had a 20% failure rate. And then when you replace that with something that is a 99% success rate, and even more importantly, that when it fails, it tells you why it failed and you can diagnose and correct it. So the failures are not random. They’re actually, “Hey, there’s a cause you know”, then it changes the whole way you look at building these systems and you can start leaning on the provisioning infrastructure, much more effectively, right. You can count on the idea that I could re provision [inaudible 00:09:49], go for coffee, come back and say, “Oh, now I’m ready to do work again.” This is transformative in how people look at running infrastructure.
Swapnil Bhartiya: From an outsider’s view, rebuilding a data center would mean, as we said earlier, I said a herculean task, but who all get involved in that? Because is this just a job? Because traditionally you talk about network engineer, you are talking about storage of course, and all those teams of it, but now we are talking infrastructure as a code. So maybe one guy, two guys, one team, DevOps, they handle it, or there are still teams involved. And also once again, back to the point of somebody deploying something in a Singapore data center, but they’re provisioning, or as you said, they are putting a seed here. So can you also talk about how the teams work together? How do you make it easier for them for this collaboration and coordination as well?
Rob Hirschfeld: This is one of the things that I found really remarkable about infrastructure as code pipelines is that there’s a whole bunch of technical pieces getting things to work end to end, but fundamentally it breaks back down to helping teams collaborate. So you can use the pipeline to coordinate all of these handoffs between the teams, the teams aren’t going away, the expertise that your networking team has and your storage team has and your compute team and your data center operations team, those are real deep expertise. The thing that has been missing is a way for them to coordinate the work between each action and because a lot of times, and this is what causes delays. The compute team does something, the network team has to do something, the compute team has to do something else. The storage team has to do something, the network, right?
It’s these handoffs back and forth that really cause errors and slow manual steps process problems. What we found with Digital Rebar is the infrastructure pipeline pieces allow the teams to coordinate in very concrete infrastructure as code. So clear, transparent ways, that enables collaboration, that actually accelerates the whole process. So then once you have all these steps coordinated, the teams can come back and say, “Oh, if I change my networking topology so I can eliminate this step or I can enable the security flag that I couldn’t before.” Right. You talked about TPMs. If I want to go back and say, “All right, I want to turn on trusted boot for all of my infrastructure”, that involves all of these teams collaborating together because there’s networking and storage and boot provisioning, actions that have to happen together. And once they do, once you have a way for them to collaborate, that’s really transformative.
That’s where the benefits are coming from. We show up with a lot of great tech and things that work out of the box for people. But that is really just the springboard for the teams around the data center infrastructure to work together better and actually have a way to say, “I’m going to do my thing. I’m going to take care of it”, and then hand it off to you. And the infrastructure pipeline handles those handoffs. It makes them normalized. And fundamentally that’s what… We talked about repaving a data center. Repaving a data center is about that collaboration, making all those teams do that work together and then rehearse it, repeat it and standardize it. And that’s really where the value’s coming from.
Swapnil Bhartiya: Of course, they are repaving it, but is that just moving slowly, everything goes from nothing to production or there’s also testing [inaudible 00:13:15] also because it depends on what they’re doing. How do you kind of enable that as well?
Rob Hirschfeld: Yeah. One of the things about infrastructure as code is it really is about bringing development processes into infrastructure and automation. And I can’t emphasize that enough. That’s what we’re trying to do. And so the thing that makes our customer successful is not that they can push a button and repave the data center. They need to do that. That’s the requirement, but the thing that makes them successful at doing it is that they actually have DEV, TEST, PROD, Pre prod, right. And then they have a strategy to migrate data center roll outs in a consistent way. So it’s not just a factor of that. You can now do it. It’s actually that we can reliably repeat all of those instructions, contain it together as infrastructure as code and then manage it. That’s really the key here is that those teams practice, rehearse DEV, TEST. If they have an issue, a lot of times we can replicate it with totally different gear and we can figure out what’s going on.
And what’s been amazing is that, that process that we’ve now standardized to such an extent that it’s basically a stamp that just follows the infrastructure around is something that other customers are now picking up that process. They change it a little bit to meet their needs, but they’re now creating reproducible results out of basically the same standardized pipelines over and over again. And it’s now taking… The first times we went through this process, it was months of work to get the processes right and working. And now companies are coming up, taking those standard pipelines, making a couple of adjustments and in a day or two replicating that out of the box experience for repaving.
And it’s been absolutely stunning for that. And our customers are not worried about people chasing them down the repaving process, because they’ve gone up stack and they’re using infrastructure pipelines to pull more and more things together, right. There’s a lot of value in starting from one or two teams and then connecting your process horizontally, just like a CICD pipeline. The more people you have collaborating around that pipeline, the more value you get out of it, it’s exactly the same process. We get people started very quickly and then they have to do the work to connect their teams together.
Swapnil Bhartiya: There may also be cases where they don’t have to rebuild their whole data center. There might be a few components they might have to add, some regulatory lead or so how much is Digital Rebar kind of capable of handling that, or this is a non-issue that is just, of course, that’s built in.
Rob Hirschfeld: That’s a really important point. It’s a feature that we enabled in our last release called Work Orders to address this. We’re big fans of this immutability. The repaving implies that I can just reset everything and start from zero and build it all the way back up. And there’s a lot of times when that’s not the best option, right? That’s you just want to make a configuration change or you just want to run a script, do a patch and you want to do that surgically. You don’t want to take down whole systems. A lot of the provisioning tools that we used to use and the way Digital Rebar started was it was designed to do this workflow, a build process. And if you wanted to make a small change, you would rerun that process and hope that it was item potent enough to not break anything in that. When we added Work Orders to the system, it addressed exactly what you’re describing, which is, “I want to run this task.” That’s part of a bigger workflow, but I just want to run that task.
And so we added the ability to schedule jobs and either ad hoc or on a timer where you can say, “I actually just need to run this one little task against a system and cue that up and allow it to run.” And that has huge implications for maintaining running systems. And you can go in not as a [inaudible 00:17:16] person has to type a command, but actually say, “Here’s a standard task as part of your normal processes, schedule it, run it, do it as infrastructure as code against an API and make that operate.”
So that’s a hugely important piece and it’s worth discussing as we’re talking about repaving that sometimes it’s just a nudge, right, or a small action on things. And you want both ends of that spectrum. And I was really excited to see when we added that into Digital Rebar, there were a lot of customers that had been saying, “I just want to run this one thing, do a patch, run an upgrade, take this one action on our system.” And so this was a really customer-driven feature, but the implications on infrastructure as code more broadly is really dramatic.
Swapnil Bhartiya: Rob, once again, thank you so much for taking time out today and talk about this very interesting and important topic, because things can become overwhelming and challenging, but the way Digital Rebar helps customer that is incredible and once again, thank you for your time today. And I look forward to our next discussion already. Thank you.
Rob Hirschfeld: Thanks Swap, appreciate it.