Voltron Data aims to make the data science ecosystem more efficient, leveraging unified tools and user interfaces from the Apache Arrow project. One of the challenges with big data was the lack of a common standard for connectivity between programming languages and computing engines. So, the Apache Arrow project came out of a need for a solution to improve interoperability and connectivity, while also having a technology that would serve as a foundation for developing next-generation, accelerated, in-memory computing.
In this episode of TFiR Let’s Talk, Swapnil Bhartiya sits down with Wes McKinney, Co-Founder and CTO of Voltron Data, to discuss the challenges of big data, which led to the creation of the Apache Arrow project. He explains how Voltron Data is using Apache Arrow to empower data science developers to accelerate getting data into applications.
Key highlights from this video interview are:
- Voltron Data aims to create unified tools and user interfaces, accelerating bringing value to the corporate enterprise world through the Arrow project. McKinney discusses the Apache Arrow open source project and ecosystem that he has been building over the past seven years and how this led to the creation of Voltron Data.
- One of the challenges with big data was the lack of a common standard for connectivity between programming languages and computing engines. McKinney explains why he created the Apache Arrow project and what he had been doing at the time that led to the creation of the project.
- Arrow had started out as a technology for interoperability and connectivity and had initially been successful in this. It has been adopted across data warehouses and database systems to accelerate getting data into applications. McKinney describes how widely it has been adopted and by what sort of companies.
- McKinney discusses the value Voltron Data is bringing to the Arrow ecosystem which they are the largest corporate contributor to. He explains their involvement in the project and what they hope to achieve from it.
- The Data Thread is an upcoming event, which aims to showcase the diversity of the ecosystem and the different ways people are using Arrow. McKinney goes into detail about what to expect from the event.
- The Data Thread will have several different themes for the talks, with some focused on interoperability, connectivity, improving usability, and the efficiency of systems being plugged together. McKinney explains that there will be lots of different use cases and describes who would be interested in the event.
The summary of the show is written by Emily Nicholls.
Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.
Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya and welcome to another episode of TFiR Let’s Talk. And today we have with us Wes McKinney, co-founder and CTO of Voltron Data. Yes. It’s great to have you on our show.
Wes McKinney: Thanks for having me.
Swapnil Bhartiya: And the focus of today’s discussion is a new event that you folks are organizing, The Data Thread. So of course, we’ll talk a lot about the event, but before that, I would love to know a bit about Voltron Data, because you are a co-founder. Tell me what a specific problem that you saw in this space, which you wanted to solve, which led to the creation of this company.
Wes McKinney: Yes. There’s many pieces to the puzzle, but I’ve been working on building the Apache Arrow open source project and ecosystem for the last seven years or so. And given the progress that the open source community has made over the last six to seven years, we saw an opportunity to bring together key innovators from the Arrow ecosystem. So Josh Patterson, who built and led the Rapids accelerated GPU, accelerated analytics, project at Nvidia, BlazingSQL, and myself and my organization Ursa Computing to create a single analytical computing company building on the success of the Apache Arrow ecosystem. So we saw this opportunity to build an organization that can create unified tools and user interfaces and accelerate bringing value to the corporate enterprise world through the Arrow project.
Swapnil Bhartiya: Since, as you mentioned that you co-created Apache Arrow project, I will also like to know that once again, talk about the original story of Apache Arrow. Once again, why you created and what were you doing at that time that led to the creation of the project.
Wes McKinney: Yes. So at the time I had spent the previous six or six or seven years working primarily in the Python data science ecosystem, I had created the Python pandas project, had been working to build out the tools for doing data science in Python.
And in 2014, my startup data pad was acquired by Cloudera and I started working with the big data ecosystem. So Apache Spark, Apache Impala, other systems in the Hadoop ecosystem, and started to look at building interfaces between Python and the big data ecosystem and found that there wasn’t a common standard for connectivity between programming languages and computing engines. And (given all of the changes that were happening around the same time with networking getting a lot faster, hard drives, storage getting a lot faster) we were searching for solutions to improve interoperability and connectivity while also having a technology that would serve as a foundation for developing next generation, accelerated, and memory computing.
Swapnil Bhartiya: Can you talk about what kind of adoption are you seeing of Apache Arrow project and what kind of use cases are there?
Wes McKinney: Yes. So Arrow serves to solve a number of problems. So it started out primarily as a technology for interoperability and connectivity. So being able to move large quantities of data very efficiently, much more efficiently than previous solutions between programming languages in between computing engines.
So you can have different computing engines, part of a data pipeline, able to process the data with almost no conversion overhead between steps in a data processing, processing pipeline. We designed the Arrow data format and the project also to serve as a place to develop accelerated in-memory computing and query processing. So, so Arrow has initially been very successful as an interoperability and connectivity technology, accelerating database drivers. It’s been adopted across data warehouses and database systems to get data into applications faster. So you can query Snowflake and get data out as Arrow. And that’s provided more than an order of magnitude performance speed up by incorporating Arrow.
But in the last couple of years, the ecosystem’s focus has shifted to building accelerated computing engines that are native to the Arrow data format so that we can provide for embedded, accelerated analytical query processing within a variety of different applications.
And so what we’re seeing across the industry is more and more systems supporting Arrow is a preferred input and output format, which it gives them significantly faster throughput and the ability to export and ingest data at much higher speeds. Once you have the data and Arrow format, you request data from a database or a storage system, you have these very high performance query processing query engines that can process the data wherever it needs to wherever it needs to go.
And so it’s really unlocked new levels of performance and efficiency across a variety of applications, as well as achieving broad adoption among the titans of industry. It’s being used within Google and Microsoft and in Meta and Netflix and many other organizations.
Swapnil Bhartiya: I think we kind of live in a data driven word. So when you’re naming these company, almost every company uses, but the differences at what skill they are using. Also, despite all the work, there is still a lot of work needs to further democratize access to make it more accessible.
But going back to the point of Apache Arrow, talk about the relationship that Voltron Data has with the project. Also, we have seen the traditional typical open-source story where you have an open-source project, and then you have a company commercial vendor who kind of supports the ecosystem through commercial offering, because open source can very easily solve day one problem, but not day two additional functionalities features, update maintenance. That’s where commercial support comes into picture. So talk about what value Voltron Data is bringing to the Arrow ecosystem or community.
Wes McKinney: Sure. Yeah. So we are the largest corporate contributor to the Arrow project. It’s a very large community. There’s been nearly a thousand unique contributors to the project, and many, many organizations have been contributing very actively.
So we’re contributing a lot of code and pushing on many new initiatives in the Arrow project. But one of the first product offerings that we rolled out as a company is called the Voltron Data Enterprise subscription for Arrow. And our goal there is to provide essentially an open source partnership with organizations that are building on Apache Arrow’s so that we can increase the probability that their Arrow projects are successful, that we can advise them on architecture and design matters to get the most out of what the Arrow project can provide them. As well as doing strategic support and feature development.
So if an organization is building a production software project on Arrow, they run into a problem in production, they run into a bug where they don’t have the bandwidth to dig into the code and fix and upstream the patch-
We can take on that burden from them. And we have so many of the experts in the project, our work at Voltron Data, we can provide that support structure to make more and more organizations, software vendors, large enterprises successful building on Arrow and modernizing their data stack and their data platform. So we’ve been really thrilled at the response unit to this offering.
And given that over the last six years, since the Arrow project formally started as an open source project, we’ve seen the initial stage of the early adopters and the Maverick organizations that have adopted Arrow. And so there’s a lot of pent-up demand for integrating Arrow into more and more systems and having an organization that companies can rely on to help them through that journey. It’s essential for many companies to take that first step and to commit to building Arrow integration into their systems.
Swapnil Bhartiya: Excellent. So, as you said, there’s a big community around Arrow, which brings me to my next stage of this discussion, which is around the upcoming event. So, if I’m not wrong, the event will bring a lot of these community members. And when we talk about community, it could be users, could be vendors. It could be of course the maintainers of the project itself. So talk about the idea behind this project and also the name.
Wes McKinney: Yeah, I mean, it’s really an interesting community because there’s such a diversity of applications and use cases and types of projects where people are using Arrow. So you take a sampling of any five or 10 users of the project, and they may be using it in different ways. And that’s definitely a lot different from, consider an open source database or some kind of storage, or big data system where the applications tend to be somewhat more homogeneous and a lot more similar to each other.
And so I think what we’ve sought to do with this event, the data thread, is to showcase the diversity of the ecosystem and the different ways that people are using Arrow and equipping and using the project. And the different libraries and components and standards that are in it to make their systems faster, more efficient, and more interoperable. So I’m very, very excited for the attendees to see that diversity and the different perspectives that have been brought. Because we have diversity across not just types of applications, but programming languages and so many different types of end users.
Swapnil Bhartiya: Can you also share what’s going to be a kind of common theme around? You did talk more about getting all the diverse use cases. But if you can just tell what kind of sessions will be there, who’ll be talking there, what kind of things folks should be looking forward to?
Wes McKinney: Yes. So there’s a few different themes to the talks. So I will say that some talks are more focused on interoperability, connectivity, improving the usability and the efficiency of systems being plugged together. Some talks are more focused on improving performance and getting the most out of modern hardware. Whereas other talks, we’ll discuss more improving the usability and the user interfaces for interacting with analytical systems. So one of the goals, I would say it’s a principal goal, of arrow is to make big data systems significantly easier to use. And one of the objectives of Voltron Data is to enable the whole ecosystem to be substantially more modular.
So that components can be interchanged more and more easily, whether that’s using different types of hardware or different programming languages, being able to reuse and repurpose systems across different environments. And so I think to have real-world users, developers, users showing how they’ve successfully taken advantage of these ideas to create systems that are more maintainable and more usable, I think is very interesting.
So there’s a lot to offer end users, data scientists, and data analysts who are more focused on the user interface and doing real world data science projects. But there’s also a lot to interest more of the data engineers, the backend infrastructure developers who are really focused on doing work at the systems level.
Swapnil Bhartiya: Yes. Thank you so much for taking time out today to talk about not only the company data, but also the event, the community, and I’m looking forward to this event. Hopefully we can do the coverage of this event in person in the future, but good luck with this event. And I love to have you back on the show again. Thank you.
Wes McKinney: Thank you so much. Look forward to seeing everyone at the event next week.