I have two datasets:
df_1 =
my_id col_1 col_2 value
ABC111 null 289 374578
ABC113 456 279 335368
ADC110 757 289 374568
ABC145 366 299 374578
ACC122 null 289 374220
df_2
my_id col_1 col_2 value_new
ABC000 null 289 374578
ABC113 456 279 330008
ADC110 757 null 374568
ABC145 366 299 374578
ACC122 null 289 374229
ACC999 null 289 374229
In order to see what rows are missing from df_1 or df_2, I did a full join by all the 4 columns. This way, I see which and how many rows don't match. The problem is, I want to see because of what column does the mismatch happen.
Desired outputs:
missing_keys_from_df_1 =
my_id col_1 col_2 value_new my_id_check col_1_check col_2_check val_check
ABC000 null 289 374578 No Yes Yes Yes
ABC113 456 279 330008 Yes Yes Yes No
ADC110 757 null 374568 Yes Yes No Yes
ABC145 366 299 374578 Yes Yes Yes Yes
ACC122 null 289 374229 Yes No No No
ACC999 null 289 374229 No No No No
So, basically, I want to copy df_2 and add 4 boolean columns that check whether that column value is in df_1. Is this possible?
CodePudding user response:
If joined on ID, this can be achieved like this. If you want id to be checked as well, then we may have to join other columns which will not give expected results.
SELECT COALESCE(df_1.my_id,df_2.my_id)
,COALESCE(df_1.col_1, df_2.col_1) col_1
,COALESCE(df_1.col_2, df_2.col_2) col_2
,COALESCE(value,value_new) value
,CASE WHEN df_1.col_1 = df_2.col_1 THEN 'YES' ELSE 'NO' END col_1_check
,CASE WHEN df_1.col_2 = df_2.col_2 THEN 'YES' ELSE 'NO' END col_2_check
,CASE WHEN df_1.value = df_2.value_new THEN 'YES' ELSE 'NO' END value_check
FROM df_1
FULL OUTER JOIN df_2 on df_1.my_id = df_2.my_id