Sandeep Singh Sambyal
2 min readAug 23, 2022

--

Data Validation between 2 database tables using pandas

Photo by Mika Baumeister on Unsplash

Validating data is an important part of the application development process. Whether it’s an upgrade/migration or a sync job, we have to ensure that the data is in the right format and flowing correctly between both systems.

ETL testers often use Microsoft Excel when data is less or has fewer columns to test data between the source and destination databases. However, the real problem arises when there are millions of rows with hundreds of columns. Usually I’ve seen organisations use paid tools for such tasks, though I’m going to share a pandas utility code that can help you to validate data quickly and free.

Let’s say we are migrating data from one database to another and now need to validate data between the tables after the data flow.

Code covering scenarios :

As shown above, you can choose whether to check if several columns match or to exclude certain columns based on your preference.

A report will be generated if you have a primary key, named “datacompy_report.txt”, and it will include all the mismatches. Unless the table has a distinct/primary key, we’ll have to filter “merge_report.csv” file and find fields that have mismatches manually. As the number of rows increases, skip those fields with mismatches and keep excluding those columns until you find all columns with mismatches.

Please feel free to contact me if you need any help. Thanks :-) Happy Coding :-)

--

--