Role Of Distributed Computing In The Data Science Working Environment

Guest: Matthew Powers (LinkedIn, Twitter)
Company: Coiled (Twitter)
Show: Let’s Talk

Coiled is an easy way to run Dask, which makes it possible to run computations on a cluster of computers. With it, companies can be empowered to do things like training machine learning (ML) models. And given how complicated data science analytics can be, this could be a real boon for some companies.

In conjunction with Linode, Coiled wrote a whitepaper about the role of distributed computing in the data science working environment. Matthew Powers, Tech Evangelist at Coiled, says of the paper, “There are two ways to scale data science analysis: You can scale out, which is scaling to a variety of different machines in a cluster environment, or you can scale up, which is using all of the cores on your existing machine.” Powers continues that the whitepaper is “really talking about how you can scale your analysis using Dask. Of course, Dask works very well with NumPy and Pandas, but with NumPy and Pandas, you hit limits and that’s when you need to start using technologies like Dask to start training your models.”

According to Powers, the target audience for this whitepaper is anybody looking to do advanced analytics. On this, Powers says, “It’s a wide audience I would say, it’s definitely more data scientists, but also data engineers. There’s a lot of overlap between the two as you know.”

Data scientists and engineers face very specific challenges when working at scale. Powers speaks about this issue by addressing how a lot of times a typical data scientist trains something on their local machine. This method may work on a small scale but, as Powers says, “then the data size grows, and now the model takes eight hours to train, and they don’t have any sort of iterative workflow where they can do this experimentation because it takes eight hours and then they can’t do their job.”

Another issue data scientists face is workflow and getting everything up and running. Powers indicates distributed computing solves this by making sure they don’t have to “think about the underlying communication between the machines, or how the tasks are being allocated to the different machines.”

The summary of the show is written by Jack Wallen

[expander_maker]

Swapnil Bhartiya: Hi. This is your host, Swapnil Bhartiya, and welcome to TFiR Let’s Talk, and today we have with us Matthew Powers, Tech Evangelist at Coiled. And today we are going to talk about the necessity of distributed computing in data science. This discussion is based on the white paper that Coiled wrote along with Linode. But before we go into that, Matthew, it’s great to have you on the show.

Matthew Powers: Yeah, thanks for having me. Great to be here.

Swapnil Bhartiya: Before we get it started, I want some background. So first of all, tell us a bit about Coiled, and what do you do at company as a tech evangelist?

Matthew Powers: Yeah, sure. So Coiled is basically a very easy way to run Dask. Dask makes it very easy to run computations on a cluster of computers. And as a tech evangelist, I’m trying to help people understand about how to use Dask and how to easily train their machine learning models and hopefully make their life a little bit better, because to be honest, running data science analysis can be kind of tough.

Swapnil Bhartiya: Now let’s talk about this white paper that Coiled wrote along with the Linode. The topic is of course, data science project and the role of distributed computing in the data science working environment. If I may ask you, tell us a bit about the white paper, what is it all about?

Matthew Powers: Basically when you’re trying to train these big machine learning models on large datasets, it’s really challenging. It’s challenging from a DevOps perspective, and it’s challenging from just getting your models to run on these large data sets. So the white paper is about, we have these huge data sets, we want to train machine learning models to further our business objectives, and how do we get these models to run on these huge data sets? And Dask is a great solution to make that happen.

Swapnil Bhartiya: Can you share some key highlights of this white paper?

Matthew Powers: There’s two ways to scale data science analysis. You can scale out, which is scaling to a variety of different machines in a cluster environment, or you can scale up, which is using all of the cores on your existing machine. So we talked about how Dask can be used to scale up and scale out our analysis, and obviously you can only scale up the size of the cores on a given machine, and then once you hit a certain limit, that’s when you need to use a cluster of machines and start distributing the computations. So it’s really talking about how you can scale your analysis using Dask. Of course, Dask works very well with NumPy and Pandas, but NumPy and Pandas, you hit limits and that’s when you need to start using technologies like Dask to start training your models.

Swapnil Bhartiya: Right. And if I may ask you for Dask point of view or from the white paper’s point of view, when we do talk about data science, ML, deep learning, it doesn’t matter what term we use. Of course, traditionally they are data scientists, but a lot of engineers are also using this IOPS out there. So who would you say is kind of the target audience for this white paper?

Matthew Powers: I think the target audience for this white paper is really anybody who’s looking to do advanced analytics. So like Dask IOPS with a lot of different technologies. So even if you want to do like a cluster of GPS for example, Dask is good for that. If you want to just scale out your Pandas or NumPy analysis, Dask is good for that, and that’s more just like your basic data frame analysis. So it’s good for data engineers as well. And then obviously if you’re running kind of circuit learn type models and you’re having trouble scaling those, Dask is good for that too. So it’s a wide audience I would say, it’s definitely more data scientists, but also data engineers. There’s a lot of overlap between the two as you know.

Swapnil Bhartiya: Right, yeah. Everybody use this. You talked about one of the challenges about scaling and that’s very unique clusters. If I can also ask, can you also talk about what are the typical challenges that data scientists or data engineers as you said, face when they do deal with this challenge of scaling and they do have to go across clusters?

Matthew Powers: Yeah. So it’s really tough being a data scientist because you need to know all the statistics, you need to know all the math, you need to know the models, and then you have these huge DevOps and scaling stuff. And it’s interesting because data scientists don’t usually care about the DevOps and this low level details. They want their models to run. And it’s unfortunate because they need to spend so much time on this, thinking about this.

So I’d say that a lot of times a typical data scientist they’ll be training something on their local machine, it’ll work maybe, and then the data size grows, and now the model takes eight hours to train, and they don’t have any sort of iterative workflow where they can do this experimentation because it’s like impractical, because it takes eight hours and then they can’t do their job. So being a data scientist is kind of painful, unless you’re using the right tools, to be honest.

Swapnil Bhartiya: Now, once again you mentioned workflow. So can you talk about the workflow challenges that data scientists or data engineer face, and how this distributed computing solves that.

Matthew Powers: Basically, it’s kind of interesting, but one of the main workflow challenges a data scientist face is getting their environment set up properly. Installing the right software is really hard, especially now that some people are using different machines with different chips. So just getting all the dependencies set up properly with GPUs and C++ and all these things, that’s tough. Then you need to think about executing the commands, if that’s going to be locally or in the cloud. Then you need to think about provisioning cloud-based resources, or if you’re in an HPC environment, how you’re going to use all the cores of your machine. So data scientists have a ton of workflow challenges before they even get to the model training part. I mean that’s even before they wrote any code.

Swapnil Bhartiya: If you look at the white paper, I want to ask you is that how critical are iterations, because you mentioned iterations? And is speed in data science, and why is that?

Matthew Powers: So I’d say iteration is very critical when it comes to data science, because it’s a lot of experimentation. You’re just looking at the data, and you’re saying, what works. You’re like, “I don’t know what works, I need to check which variables give me results.” So in order to run those experiments, you need some fast iteration times. And if your model takes eight hours to run or 10 hours to run, or even 30 minutes to run, that’s pretty painful. So you really do need a setup where you can train your model, and get some quick feedback and then tinker, and then train it again, and then see which is going to give you the best results. So I’d say it’s kind of interesting. I’ve seen this where it’s like if your model takes a few hours to train, it just dramatically decreases the data scientist’s productivity.

Swapnil Bhartiya: We also learn in this white paper that parallelism and distributed systems support data science and scale, and the scale was the challenge that we were discussing earlier. Can you tell us more about that?

Matthew Powers: Basically, a data scientist wants to be able to run their computations on a big cluster of machines. And they don’t want to think about the underlying communication between the machines, or how the tasks are being allocated to the different machines. They just want to run these computations and then have the underlying framework manage all those low level details. So basically what Dask does is, it’s going to take the instructions from the data scientist, and then split it among all of the machines in the cluster, and run the computations in parallel. The data scientist doesn’t need to think about that. And that’s a classic abstraction, and it’s critical for data science because they have so much other things to be thinking about, which is like how to make the best model for the business.

Swapnil Bhartiya: Since you have been in this industry space for so long, if I ask you, what are the trends that you are seeing in data science or data engineering space, because it will be fair to say if we live in a data driven world without all the analytics, whatever we are doing, most of what we are doing, won’t even matter. We are also moving a lot to our automation and security also AI MLS are playing a big role. So talk about what kind of changes you are seeing where the adoption is growing.

Matthew Powers: So at least for me, the reason I got into this data world is because I think that’s the future, right? Businesses that want to get more profitable, the way they’re going to do that is with data, and training advanced analytics models, and then deploying them to production. So certain trends I’m seeing in this industry are version data, that’s a big one. Just managing the data, and managing your data warehouse, querying it in an effective manner, that’s hard, and you really want a fixed version of your data, so when you’re training your models, your results aren’t being conflated by the new data that’s being added. That’s one trend I’m seeing, so version data.

Another one is getting better at productionalizing your models. That’s like ML Ops, and of course training models and giving data scientists environment where they can be productive. I think that that’s a big one because there are a lot of different places you can train your models, but if you don’t like that environment and you’re not productive, it doesn’t matter. And that’s probably the reason why I joined Dask, is just because I feel like it’s a very productive environment. It feels very Pythonic, and data scientists like that. They feel comfortable in the Dask world. So that’s why they’re happy to be building their models there.

Swapnil Bhartiya: You talked about data warehouse. We also hear a lot of discussion these days about data warehouses versus data lakes. They have both their own capabilities and also limitations. But what are trends that you’re seeing, where people are moving more towards? Because also, one big challenge with data is, when you’re putting in analytics is that you cannot move that much data around. So once you get some data somewhere… So can you talk about that also, or is that’s not something that you worry about too much?

Matthew Powers: Yeah, I think about it a lot. I’m seeing a big trend toward people storing data in Parquet files in cloud based storage systems. And that’s great, but then Parquet files have limitations, like they’re immutable and stuff like that. So let’s say you have some GDPR compliance stuff and all of your data stored in Parquet files and then you to delete certain records. That’s actually a surprisingly expensive operation with Parquet lakes.

So we’re seeing these trends towards things that solve certain problems, and then all of a sudden, problems that were easy to solve with a relational database are really hard to solve with the new Parquet files in the cloud-based storage system. So it’s interesting how the industry is grappling with the… We have new solutions that are more scalable, but then they can’t solve what used to be a basic problem, which is deleting a few records.

Swapnil Bhartiya: Matthew, thank you so much for taking time out today and not only talk about, of course your own journey, about the company, also white paper and also share your insights on where we are moving with data in the data driven world, and the challenges the scientists or data engineers, whatever you call them. The rules are changing, evolving, but thanks for the insights and I’d love to have you back on the show. Thank you for your time today.

Matthew Powers: Thank you so much for having me.

[/expander_maker]

You may also like

How Transposit helps companies collaborate across teams

Azul Intelligence Cloud enhances security and boosts productivity

What IBM’s acquisition of HashiCorp means for Terraform licensing

Apiiro wants to be the Diamond Standard for Application Security Posture Management

Akamai further fortifies its API Security with latest PCI DSS Compliance

If Iron Man has Jarvis, Transposit has Tanya