AI/MLCloudDevOpsFeaturedLet's Talk

Role Of Distributed Computing In The Data Science Working Environment

0

Guest: Matthew Powers (LinkedIn, Twitter)
Company: Coiled (Twitter)
Show: Let’s Talk

Coiled is an easy way to run Dask, which makes it possible to run computations on a cluster of computers. With it, companies can be empowered to do things like training machine learning (ML) models. And given how complicated data science analytics can be, this could be a real boon for some companies.

In conjunction with Linode, Coiled wrote a whitepaper about the role of distributed computing in the data science working environment. Matthew Powers, Tech Evangelist at Coiled, says of the paper, “There are two ways to scale data science analysis: You can scale out, which is scaling to a variety of different machines in a cluster environment, or you can scale up, which is using all of the cores on your existing machine.” Powers continues that the whitepaper is “really talking about how you can scale your analysis using Dask. Of course, Dask works very well with NumPy and Pandas, but with NumPy and Pandas, you hit limits and that’s when you need to start using technologies like Dask to start training your models.”

According to Powers, the target audience for this whitepaper is anybody looking to do advanced analytics. On this, Powers says, “It’s a wide audience I would say, it’s definitely more data scientists, but also data engineers. There’s a lot of overlap between the two as you know.”

Data scientists and engineers face very specific challenges when working at scale. Powers speaks about this issue by addressing how a lot of times a typical data scientist trains something on their local machine. This method may work on a small scale but, as Powers says, “then the data size grows, and now the model takes eight hours to train, and they don’t have any sort of iterative workflow where they can do this experimentation because it takes eight hours and then they can’t do their job.”

Another issue data scientists face is workflow and getting everything up and running. Powers indicates distributed computing solves this by making sure they don’t have to “think about the underlying communication between the machines, or how the tasks are being allocated to the different machines.”

The summary of the show is written by Jack Wallen

Read Transcript
Don't miss out great stories, subscribe to our newsletter.

Login/Sign up