TPI, or Terraform Provider Iterative, is the first product on HashiCorp’s Terraform technology stack. Dmitry Petrov, Co-Founder and CEO of Iterative.ai, joined us on TFiR Newsroom to talk about TPI and how the product helps machine learning (ML) teams manage their computing resources more efficiently. It offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s) without needing to be a cloud expert.
Key highlights from this video interview are:
- Petrov discusses how TPI is different from traditional IT tools that data scientists and machine learning engineers use.
- He discusses how Iterative is helping with resource orchestration by putting it in a traditional software development stack.
- Petrov explains why they chose Terraform and how it helps teams collaborate better and help with cost saving on GPU or CPU instances.
- TPI simplifies machine learning training, which can take a lot of time. It helps with cost-cutting. Petrov elaborates on the potential costs of training a single model just on computational resources and the challenges with infrastructure optimization.
- Petrov explains how spot instances work and how TPI does spot instances automatically. He explains how with TPI the cloud takes care of the recovery of your instances automatically and how this saves resources.
- TPI is open source meaning you can download the software into your machine and can work directly with your resources without third-party services and additional infrastructure. Petrov explains how people can get started with TPI and make use of it.
- Petrov discusses some of the use cases of TPI and how it is helping to close the gap between training and productization.
- Petrov shares their future plans to build more features for machine learning engineers and how they are working towards helping people iterate 1,000 times in a single model and the steps they need to take to achieve this.
Guest: Dmitry Petrov (LinkedIn, Twitter)
Company: Iterative.ai (Twitter)
Show: TFiR Newsroom
Keywords: Terraform Provider Iterative, TPI
About Dmitry Petrov: Creator of DVC. Ex-Data Scientist at Microsoft. PhD in Computer Science.
About Iterative.ai: Iterative.ai, the company behind Iterative Studio and popular open-source tools DVC and CML, enables data science teams to build models faster and collaborate better with data-centric machine learning tools. Iterative’s developer-first approach to MLOps delivers model reproducibility, governance, and automation across the ML lifecycle, all integrated tightly with software development workflows. Iterative is a remote-first company, backed by True Ventures, Afore Capital, and 468 Capital.
The summary of the show is written by Emily Nicholls.
[expander_maker]
Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.
Swapnil Bhartiya: Hi, this is your host, Swapnil Bhartiya, and welcome to another episode or TFiR Newsroom. And today we have with us, once again Dmitry Petrov, co-founder and CEO of Iterative.ai. Dmitry, it’s great to have you back on the show.
Dmitry Petrov: Inspired to be here.
Swapnil Bhartiya: And today we are going to talk about Terraform Provider Iterative, or TPI. It is the first product on HashiCorp’s, Terraform technology stack. I want to talk, there are so many things to talk about, but I would like to start with first, tell us what is Terraform Provider Iterative?
Dmitry Petrov: This is a product that helps data scientists to create their infrastructure, to basically allocate resources for machine learning. It’s especially designed for machine learning because it’s designed for spot instances, spot instance recovery. For example, when you train for a long time, like on spot instances, it can be terminated by cloud provider. And Terraform provider helps you to move the job to another instance, and continue the training.
Dmitry Petrov: And also, it helps you to terminate your instances in the right time, because when you run your machine learning job for a long time, very often people just forget to terminate them. And you end up with a running instance for a long time. At the end of the day, it helps machine learning teams to save their resources, their computational resources, their GPU instances.
Swapnil Bhartiya: In addition to helping them save resources, what else? I mean, of course, you did touch upon that, but I just want to understand or to help them understand what are the other things that it does for them to make things easier for them?
Dmitry Petrov: The biggest thing that we have done in this project is it works on top of Terraform. It works, it’s part of Terraform Provider. It’s not just a tool that have data scientist, that’s a Terraform. Very regular API, very regular configuration file that many lops engineers and some ML engineers are familiar with. This is the biggest change in the resource registration.
Swapnil Bhartiya: Since it is running on top of Terraform, I quickly want to talk about Terraform. Or if you can also talk about if you look at infrastructure as code, what role is it playing in helping data scientist and machine learning engineers? Because it has a totally different approach than traditional IT.
Dmitry Petrov: Yeah. Traditionally, people just create some library and tools to help machine learning engineers. What we are saying that yes, data scientists need the tools and need help with resource orchestration. We just put this in a traditional stack, in traditional software development stack.
Dmitry Petrov: Which is what our company does in general. We are saying that, “Yes, you need a AI platform for your machine learning engineers for data scientists, but it does not make sense to build a separate platform outside of your development stack.” The AI capabilities, the AI way that you should go from, should be built on top of software development stack. And Terraform Provider is just another step to that direction.
Swapnil Bhartiya: Tell us, no, why did you choose Terraform there?
Dmitry Petrov: Oh, Terraform is a de facto standard among DevOps software engineers how to have your provision resources. And what we have done, we just created a core provider that helps to specific machine learning use cases. It helps you with cost saving on your GPU instances or CPU instances.
Dmitry Petrov: It helps team collaborate better, data science team with DevOps. When DevOps asks, “How do you provision resources?” Data scientists, shouldn’t say like, “You know what? We have like a bunch of scripts or we have some separate tool.”
Dmitry Petrov: Now people can say, “We use Terraform,” and everyone understand what it’s about. That’s the goal. It’s a simplification of the life, of simplification of collaboration between the departments.
Swapnil Bhartiya: How does TPI simplifies machine learning training and help machine learning teams? It could be data scientists or engineers to save both time and resources and money, as you already talked about. Resource if you pay attention to resources, it does translate into a dollar amount that you’re spending on it.
Dmitry Petrov: When machine learning engineers need to train their jobs, their models, it usually takes a lot of time. Sometimes we are talking about hours, sometimes days. In many cases, it includes very expensive GPU, sometimes a few dollars per hour. You can easily end up to spending a few thousands of dollars to train one single model just on computational resources.
Dmitry Petrov: And with this amount of money that people spend that a lot of opportunity for optimizations. I’m not talking about optimizations, like model in architectural level. Data scientists now have to do this. I am talking about infrastructure optimization. Very common problem. It happens with almost every data scientist in their life. You train your job and you just forget to shut down your instance. And it runs for another hour a day. Sometimes you forget it for a few days, for weekends. But those resources, this amount of resources just wasted.
Dmitry Petrov: And sometimes we are talking about… Oh, last time I had this, it happens with me. Not with me, but in my team, we spent $4,000 for nothing just because the instance was not terminated. It’s important to include this functionality in your infrastructure. If your training is done, it should automatically shut down. That’s one straightforward feature that we have in Terraform Provider.
Dmitry Petrov: The second is spot instances. When you train for a long time, let’s say you need one or two days of training for some neural net, people use spot instance because they are cheaper. At the same time cloud provider can terminate your spot instance around the time. And you need to write your own logic, how to do the recovery. How to send the data back, how to make a new instance, spot instance. Wait for a new spot instance and then send data back to that spot instance to continue the training.
Dmitry Petrov: And this is what TPI does automatically. If you just specify where is your data, what instance you need. And it runs training on cloud instances, spot instances, and thus care of recovery automatically.
Dmitry Petrov: And the last item. But it’s technically the most interesting one, actually. When you need the recovery logic, usually you need some like a manager machine that monitors your infrastructure and say, “Okay, that instance failed. I need to recover this.” And it’s called master machine. Or some other way, like master note.
Dmitry Petrov: We designed Terraform Provider in a way that it’s not needed because we use tricks from cloud, like Auto Scaling group. And we use cloud storages to save intermediate models, checkpoints and data sets changes. You don’t need to run additional instance. Cloud will take care of recovery of your instances automatically. But Terraform GPI just set it up for you and then destroyed back when job is done. A lot of opportunity for cost saving from resource orchestration part. And the TPI uses opportunities and help you saving resources.
Swapnil Bhartiya: In typical open source fashion, there are two aspects. One is that you go and you do all the work. You update it, maintain, it, manage it, secure it. At the same time, you only are interested in using your tool. You don’t want to care about the whole life cycle management of the tool. For a lot of teams, they don’t want. They just want to use the tool. Yeah. Can you talk about that aspect also?
Dmitry Petrov: Our Terraform Provider of TPI, it’s open source, so you can go to the website, download it or install and use. And it uses same approach as Terraform itself. What does mean? It means you don’t need a separate service or master note. You just download software into your machine, specify configuration around the command and it provisionally sources for you while you have configuration.
Dmitry Petrov: There are no managing part or third party service. You directly work with your resources. You directly manage your resources from your machine. And this is exactly how TPI works. Even for recovery logic, when you might ask, “All right, but if my spot instance will be terminated by cloud provider, who will be recovering the spot instance?” And that’s where is the cloud wheel because we set up cloud in a way that, “Okay, there is an instance, and if it’s fails, please recover it.”
Dmitry Petrov: You can issue your command. It starts ML training. Then you can close your laptop and forget about it. if spot instance failed, the cloud will recover spot instant, training will continue. When job is done, the instance will be shut down and the job is done.
Dmitry Petrov: Basically, it will release all the computational resources that it uses. That’s where the cost saving comes from. And later when you open your laptop and say, “Okay, update,” you will get result of your training. You will get your model with checkpoints and all the stuff. That’s a beauty of this approach. If like it’s like serverless, if you wish. Probably not the best word, but this is a way how you can organize your infrastructure without a third party services and additional infrastructure, additional moving parts.
Swapnil Bhartiya: If you can also share some use cases where it is being used. And, of course, you can or cannot name customers, but just give us an idea, who is using it, how they are using it.
Dmitry Petrov: Oh, absolutely. TPI is used for ML training, but because of its nature, it’s Terraform, it can give you a better, more value on a close to production use cases. When you need to retrain your model on a regular base and DevOps team or MLS ops team organizes this workflow. They can benefit this from TPI a lot.
Dmitry Petrov: If you go closer to development model, training side model, development side, people use this for training. Unfortunately, it was not optimized for that use cases for model training. Terraform provide [inaudible 00:11:35] of GPI. It’s mostly about productization and like a closing gap between your training and productization.
Dmitry Petrov: Usually what happens in a big company, there are two different set of folks. ML engineers who do a modern lot of iterations, hundreds sometimes thousands of iterations and then models there. And it needs to be productionalized. At the same time, it’s not enough. You need to update model periodically. Let’s say data is changing or code is changing.
Dmitry Petrov: You need to have a flow, how to update your model on a regular base and put it to some production environment. And this is when Terraform and this more formal organized DevOps approach feed the best. When you can optimize this model retraining part, yeah, with the help of DevOps team, MLS ops team and data science team.
Swapnil Bhartiya: We talked often, and there are certain things that you cannot talk about at this point. But if I, so what’s next in your pipeline? What are the things that you are working on and what are the following you’re trying to solve?
Dmitry Petrov: Yeah. Right now, Terraform is mostly designed for this productionization part, I would say. For MLS ops, for engineers, for DevOps engineers. We need to implement more features for ML engineers. We need to support better scenario when people iterate 1,000 times in a single model. That’s one big step we need to take.
Dmitry Petrov: The next step would be distributed training. Because right now, when we say, “Okay, we can get resources for you,” we are talking about one single machine. It can be a huge, expensive GPU instance. But at the same time, it’s only one machine. Some teams use distributed training, four machines, six machines at the same time to train a simple model. This would be the following step.
Swapnil Bhartiya: Dmitry, thank you so much for talking about TPI and also sharing how you are helping data scientists and machine learning teams. And as usual, I would love to have you back on the show. Thank you.
Dmitry Petrov: Thank you.
[/expander_maker]