Home > Mobile >  Joining two datasets by multiple columns and marking the column where the mismatch happens
Joining two datasets by multiple columns and marking the column where the mismatch happens

Time:12-08

I have two datasets:

df_1 = 

 my_id      col_1    col_2       value
ABC111       null      289      374578
ABC113        456      279      335368
ADC110        757      289      374568
ABC145        366      299      374578
ACC122       null      289      374220

df_2

 my_id      col_1    col_2       value_new
ABC000       null      289          374578
ABC113        456      279          330008
ADC110        757     null          374568
ABC145        366      299          374578
ACC122       null      289          374229
ACC999       null      289          374229

In order to see what rows are missing from df_1 or df_2, I did a full join by all the 4 columns. This way, I see which and how many rows don't match. The problem is, I want to see because of what column does the mismatch happen.

Desired outputs:

missing_keys_from_df_1 =

     my_id      col_1    col_2       value_new  my_id_check col_1_check col_2_check val_check   
    ABC000       null      289          374578         No          Yes         Yes       Yes
    ABC113        456      279          330008         Yes         Yes         Yes       No
    ADC110        757     null          374568         Yes         Yes         No        Yes
    ABC145        366      299          374578         Yes         Yes         Yes       Yes
    ACC122       null      289          374229         Yes         No          No        No
    ACC999       null      289          374229         No          No          No        No

So, basically, I want to copy df_2 and add 4 boolean columns that check whether that column value is in df_1. Is this possible?

CodePudding user response:

If joined on ID, this can be achieved like this. If you want id to be checked as well, then we may have to join other columns which will not give expected results.

SELECT COALESCE(df_1.my_id,df_2.my_id)
    ,COALESCE(df_1.col_1, df_2.col_1) col_1
    ,COALESCE(df_1.col_2, df_2.col_2) col_2
    ,COALESCE(value,value_new) value
    ,CASE WHEN df_1.col_1 = df_2.col_1 THEN 'YES' ELSE 'NO' END col_1_check
    ,CASE WHEN df_1.col_2 = df_2.col_2 THEN 'YES' ELSE 'NO' END col_2_check
    ,CASE WHEN df_1.value = df_2.value_new THEN 'YES' ELSE 'NO' END value_check
 FROM df_1
 FULL OUTER JOIN df_2 on df_1.my_id = df_2.my_id
  • Related