Columns changing names when I try to filter rows from a DataFrame that aren't in another DataFr-CodePudding

Following this answer:

https://stackoverflow.com/a/47107164/11462274

I try to create a DataFrame that is only the lines not found in another DataFrame, however, not according to all columns, but according to only some specific columns, so I tried to do it this way:

import pandas as pd

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10],
                           'col3' : [1,5,7,9,6,7]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12],
                           'col3' : [1,5,8]})

df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'], 
                   how='left', indicator=True)

df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)

But note that when not using all the columns, they change their name like col2 to col2_x:

   col1  col2_x  col3  col2_y     _merge
0     1      10     1    10.0       both
1     2      11     5    11.0       both
2     3      12     7     NaN  left_only
3     4      13     9     NaN  left_only
4     5      14     6     NaN  left_only
5     3      10     7     NaN  left_only

So when I try to create the final DataFrame without the unnecessary columns, the unused columns are not found to generate the desired filter:

KeyError(f"{not_found} not in index")
KeyError: "['col2'] not in index"

CodePudding user response：

You can use the suffixes parameter of pandas.DataFrame.merge :

df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'], 
                   how='left', indicator=True, suffixes=("", "_"))

df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]

Output :

print(df_merge)
   col1  col2  col3
2     3    12     7
3     4    13     9
4     5    14     6
5     3    10     7

CodePudding user response：

Another option is that considering that it's left join you can just drop the columns from other df that you know would overlap (thereby making a smaller merge result):

df_merge = df1.merge(df2.drop_duplicates().drop(columns=['col2']), 
                     on=['col1','col3'], how='left', indicator=True)

df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)

  col1  col2  col3
2     3    12     7
3     4    13     9
4     5    14     6
5     3    10     7