Home > database >  How to compare two python pandas dataframe and find unmatched column name as status
How to compare two python pandas dataframe and find unmatched column name as status

Time:02-01

I want to compare two pandas dataframe and i have to print row which are not matching along with column anme which is not matching as remark column. my dataframe look like

            df1 
id first_name last_name salary
 1        AAA       FFF   1000
 2        BBB       GGG   1000
 3        CCC       HHH   1000
 4        DDD       III   1000
 5        EEE       JJJ   1000
 7        PPP       QQQ   5000
             df2
 id first_name last_name salary
 1        AAA       FFF   2000
 2        BBB       GGG   1000
 3        CCC       HHH   1000
 4        OOO       III   1000
 5        EEE       JJJ   1000
 6        YYY       ZZZ   5000
           expected df
 id first_name last_name salary remark
 1        AAA       FFF   1000
 1        AAA       FFF   2000  salary
 4        DDD       III   1000
 4        OOO       III   1000  first_name
 6        YYY       ZZZ   5000  not present in df1
 7        PPP       QQQ   5000  not present in df2

I tried so much but i did not find expected soulution.

CodePudding user response:

I would use:

tmp1 = pd.concat([df1.set_index('id'), df2.set_index('id')],
                 keys=['df1', 'df2'], axis=1)

tmp2 = tmp1['df1'].ne(tmp1['df2'])

m1 = tmp2.any(axis=1)
m2 = tmp2.all(axis=1)

out = tmp1[m1].stack(0).reset_index(1)

out['remark'] = tmp2[m1].dot(df1.columns.difference(['id']) ', ').str[:-2]

out.loc[m2, 'remark'] = 'Only present in '   out.pop('level_1')[m2]
# or for "not present in"
# out.loc[m2, 'remark'] = 'Not present in '   out.pop('level_1')[m2].map({'df1': 'df2', 'df2': 'df1'})

Output:

   first_name last_name  salary               remark
id                                                  
1         AAA       FFF  1000.0               salary
1         AAA       FFF  2000.0               salary
4         DDD       III  1000.0           first_name
4         OOO       III  1000.0           first_name
7         PPP       QQQ  5000.0  Only present in df1
6         YYY       ZZZ  5000.0  Only present in df2
  • Related