Following this answer:
https://stackoverflow.com/a/47107164/11462274
I try to create a DataFrame that is only the lines not found in another DataFrame, however, not according to all columns, but according to only some specific columns, so I tried to do it this way:
import pandas as pd
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10],
'col3' : [1,5,7,9,6,7]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12],
'col3' : [1,5,8]})
df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'],
how='left', indicator=True)
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)
But note that when not using all the columns, they change their name like col2
to col2_x
:
col1 col2_x col3 col2_y _merge
0 1 10 1 10.0 both
1 2 11 5 11.0 both
2 3 12 7 NaN left_only
3 4 13 9 NaN left_only
4 5 14 6 NaN left_only
5 3 10 7 NaN left_only
So when I try to create the final DataFrame without the unnecessary columns, the unused columns are not found to generate the desired filter:
KeyError(f"{not_found} not in index")
KeyError: "['col2'] not in index"
CodePudding user response:
You can use the suffixes
parameter of pandas.DataFrame.merge
:
df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'],
how='left', indicator=True, suffixes=("", "_"))
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
Output :
print(df_merge)
col1 col2 col3
2 3 12 7
3 4 13 9
4 5 14 6
5 3 10 7
CodePudding user response:
Another option is that considering that it's left
join you can just drop the columns from other df that you know would overlap (thereby making a smaller merge result):
df_merge = df1.merge(df2.drop_duplicates().drop(columns=['col2']),
on=['col1','col3'], how='left', indicator=True)
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)
col1 col2 col3
2 3 12 7
3 4 13 9
4 5 14 6
5 3 10 7