I want to compare two pandas dataframe and i have to print row which are not matching along with column anme which is not matching as remark column. my dataframe look like
df1
id first_name last_name salary
1 AAA FFF 1000
2 BBB GGG 1000
3 CCC HHH 1000
4 DDD III 1000
5 EEE JJJ 1000
7 PPP QQQ 5000
df2
id first_name last_name salary
1 AAA FFF 2000
2 BBB GGG 1000
3 CCC HHH 1000
4 OOO III 1000
5 EEE JJJ 1000
6 YYY ZZZ 5000
expected df
id first_name last_name salary remark
1 AAA FFF 1000
1 AAA FFF 2000 salary
4 DDD III 1000
4 OOO III 1000 first_name
6 YYY ZZZ 5000 not present in df1
7 PPP QQQ 5000 not present in df2
I tried so much but i did not find expected soulution.
CodePudding user response:
I would use:
tmp1 = pd.concat([df1.set_index('id'), df2.set_index('id')],
keys=['df1', 'df2'], axis=1)
tmp2 = tmp1['df1'].ne(tmp1['df2'])
m1 = tmp2.any(axis=1)
m2 = tmp2.all(axis=1)
out = tmp1[m1].stack(0).reset_index(1)
out['remark'] = tmp2[m1].dot(df1.columns.difference(['id']) ', ').str[:-2]
out.loc[m2, 'remark'] = 'Only present in ' out.pop('level_1')[m2]
# or for "not present in"
# out.loc[m2, 'remark'] = 'Not present in ' out.pop('level_1')[m2].map({'df1': 'df2', 'df2': 'df1'})
Output:
first_name last_name salary remark
id
1 AAA FFF 1000.0 salary
1 AAA FFF 2000.0 salary
4 DDD III 1000.0 first_name
4 OOO III 1000.0 first_name
7 PPP QQQ 5000.0 Only present in df1
6 YYY ZZZ 5000.0 Only present in df2