How to only keep dataframe rows that shares same value on a specific column-CodePudding

I have two dataframes that I want to compare, but first I want to assert that the first column (that I'm using as index) is the same for both.

df1

    A   B   C   D   E
0   a   10  5   18  20
1   b   9   18  11  13
2   c   8   7   12  5
3   z   6   5   3   90

df2

    A   B   C   D   E
0   a   10  45  10  22
1   b   99  18  11  13
2   e   8   7   12  5
3   f   6   5   3   90

I want to keep only the rows where the value in column A is in both dataframes. So I expect an output like that for df1 and df2.

df3

    A   B   C   D   E
0   a   10  5   18  20
1   b   9   18  11  13

df4

    A   B   C   D   E
0   a   10  45  10  22
1   b   99  18  11  13

I'd also like to retrieve the deleted rows.

deleted_df

    A   B   C   D   E 
0   c   8   7   12  5
1   z   6   5   3   90
2   e   8   7   12  5
3   f   6   5   3   90

I've tried this for now:

df3 = df1[df1['A'].isin(df2['A'])]
df4 = df2[df2['A'].isin(df1['A'])]

which seems to work but I'm not sure, but I still want to retrieve the difference between df3 & df1 (and espectively, df4 & df2)

CodePudding user response：

One thing you can do is outer merge with passing True for indicator:

>>> df1.merge(df2, on='A', indicator=True, how='outer', suffixes=('1',  '2',))
 
   A    B1    C1    D1    E1    B2    C2    D2    E2      _merge
0  a  10.0   5.0  18.0  20.0  10.0  45.0  10.0  22.0        both
1  b   9.0  18.0  11.0  13.0  99.0  18.0  11.0  13.0        both
2  c   8.0   7.0  12.0   5.0   NaN   NaN   NaN   NaN   left_only
3  z   6.0   5.0   3.0  90.0   NaN   NaN   NaN   NaN   left_only
4  e   NaN   NaN   NaN   NaN   8.0   7.0  12.0   5.0  right_only
5  f   NaN   NaN   NaN   NaN   6.0   5.0   3.0  90.0  right_only

This way, you will have the information if the row has been derived from both the columns, or only one of the left/right.

CodePudding user response：

use isin:

df1.loc[df1.A.isin(df2.A)]

   A   B   C   D   E
0  a  10   5  18  20
1  b   9  18  11  13

isin returns a boolean Series which you use to filter :

df1.A.isin(df2.A)
0     True
1     True
2    False
3    False
Name: A, dtype: bool

For deleted rows:

df1 = df1.set_index('A')
df2 = df2.set_index('A')
deleted = df1.index.symmetric_difference(df2.index)
pd.concat([df1, df2]).loc[deleted]
   B  C   D   E
A              
c  8  7  12   5
e  8  7  12   5
f  6  5   3  90
z  6  5   3  90