Home > OS >  How to find differences between the same table but from 2 databases(snowflake/sql)
How to find differences between the same table but from 2 databases(snowflake/sql)

Time:12-22

So I have a table that is in Testing and Production and I want to check to see whether there is any discrepancy on "code_num" column. I need to do this to validate that the table has been migrated correctly.

Is there an easy way to do this through SQL

CodePudding user response:

You should be able to access both prod and dev from same web interface. Depending on your settings, you can access prod data like select * from prod.myschema.mytable and test by select * from test.myschema.mytable.

Then you can easily compare both of them like below-

--code_num in test but not in prod
select * from test.myschema.mytable where code_num not in (select code_num from prod.myschema.mytable);
--code_num in prod but not in test
select * from prod.myschema.mytable where code_num not in (select code_num from test.myschema.mytable);

CodePudding user response:

The problem statement that you have mentioned is also referred as "Load Assurance" or "Audit, Balance and Control" or "Reconciliation". The solution for this is usually designed and developed as a framework. The framework is developed in such a way that it should be able to take in different reconciliation rules, execute those rules and then persist for visualization later.

Now, coming to the techniques to reconcile. Reconciliation between source and target in a Data Lake or DWH scenario becomes challenging due to the volume of data. It would be computationally cost prohibitive if you would like to do a row by row comparison. So, the technique that is usually used is based on aggregation. I have earlier developed multiple such solutions which calculates the below aggregations on both source and target and then compares the result to find differences in migration. The aggregation functions are as below

Numeric Columns: MAX(), MIN(), Count of Distinct Values, Count of Null and Count(*) Non-Numeric: Count of Distinct Values, Count of Null

For row comparison, take a sample of rows(ensuring that it represents the population) and then created a message digest(MD5) of the rows to compare between source and target

  • Related