DevelopersDevOpsNewsOpen Source

Datafold Open Sources Data Diff Tool For Comparing Datasets Across Databases

0

Datafold, a data reliability company, has launched a new open source cross-database diffing package called data-diff. This new product is an open source extension to Datafold’s original Data Diff tool for comparing data sets. Open source data-diff validates the consistency of data across databases using high-performance algorithms.

In the modern data stack, companies extract data from sources, load that data into a warehouse, and transform that data so that it can be used for analysis, activation, or data science use cases. Datafold, focused on automated testing during the transformation step with Data Diff, ensures that any change made to a data model does not break a dashboard or cause a predictive algorithm to have the wrong data. With the launch of open source data-diff, Datafold can now help with the extract and load part of the process.

Open source data-diff verifies that the data that has been loaded matches the source of that data where it was extracted. All parts of the data stack need testing for data engineers to create reliable data products, and Datafold now gives them coverage throughout the extract, load, transform (ELT) process.

Available today, data-diff uses checksums to verify 100% consistency between two different data sources quickly and efficiently. This method allows for a row-level comparison of 100 million records to be done in just a few seconds, without sacrificing the granularity of the resulting comparison.

Datafold has released data-diff under the MIT license.