I have two dataframes that I want to compare, but first I want to assert that the first column (that I'm using as index) is the same for both.
df1
A B C D E
0 a 10 5 18 20
1 b 9 18 11 13
2 c 8 7 12 5
3 z 6 5 3 90
df2
A B C D E
0 a 10 45 10 22
1 b 99 18 11 13
2 e 8 7 12 5
3 f 6 5 3 90
I want to keep only the rows where the value in column A is in both dataframes. So I expect an output like that for df1 and df2.
df3
A B C D E
0 a 10 5 18 20
1 b 9 18 11 13
df4
A B C D E
0 a 10 45 10 22
1 b 99 18 11 13
I'd also like to retrieve the deleted rows.
deleted_df
A B C D E
0 c 8 7 12 5
1 z 6 5 3 90
2 e 8 7 12 5
3 f 6 5 3 90
I've tried this for now:
df3 = df1[df1['A'].isin(df2['A'])]
df4 = df2[df2['A'].isin(df1['A'])]
which seems to work but I'm not sure, but I still want to retrieve the difference between df3 & df1 (and espectively, df4 & df2)
CodePudding user response:
One thing you can do is outer
merge with passing True
for indicator
:
>>> df1.merge(df2, on='A', indicator=True, how='outer', suffixes=('1', '2',))
A B1 C1 D1 E1 B2 C2 D2 E2 _merge
0 a 10.0 5.0 18.0 20.0 10.0 45.0 10.0 22.0 both
1 b 9.0 18.0 11.0 13.0 99.0 18.0 11.0 13.0 both
2 c 8.0 7.0 12.0 5.0 NaN NaN NaN NaN left_only
3 z 6.0 5.0 3.0 90.0 NaN NaN NaN NaN left_only
4 e NaN NaN NaN NaN 8.0 7.0 12.0 5.0 right_only
5 f NaN NaN NaN NaN 6.0 5.0 3.0 90.0 right_only
This way, you will have the information if the row has been derived from both the columns, or only one of the left/right.
CodePudding user response:
use isin
:
df1.loc[df1.A.isin(df2.A)]
A B C D E
0 a 10 5 18 20
1 b 9 18 11 13
isin
returns a boolean Series which you use to filter :
df1.A.isin(df2.A)
0 True
1 True
2 False
3 False
Name: A, dtype: bool
For deleted rows:
df1 = df1.set_index('A')
df2 = df2.set_index('A')
deleted = df1.index.symmetric_difference(df2.index)
pd.concat([df1, df2]).loc[deleted]
B C D E
A
c 8 7 12 5
e 8 7 12 5
f 6 5 3 90
z 6 5 3 90