I want to find rows that are missing in the first dataframe.
So I have two dataframes:
data1 = {'first_column': ['2', '3'],
'second_column': ['2', '2'],
'third_column':['2', '1'],
}
data2 = {'first_column': ['2', '2'],
'second_column': ['2', '2'],
'third_column':['2', '2'],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
expected output:
first_column second_column third_column
3 2 1
CodePudding user response:
merge
with indicator=True
and use it to select the left_only
rows:
out = (df1
.merge(df2, how='left', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
NB. if you have duplicates in df2
, use df2.drop_duplicates()
in the merge to improve efficiency like shown by @AlwaysSunny.
output:
first_column second_column third_column
2 3 2 1
CodePudding user response:
If I understood your requirement properly then, One way to do it using drop_duplicates
and merge
. Finally, pick only the left_only
.
import pandas as pd
data1 = {'first_column': ['2', '3'],
'second_column': ['2', '2'],
'third_column':['2', '1'],
}
data2 = {'first_column': ['2', '2'],
'second_column': ['2', '2'],
'third_column':['2', '2'],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
final_result = df.loc[df._merge=='left_only',df.columns!='_merge']
print(final_result.reset_index())
Output:
first_column second_column third_column
0 3 2 1
CodePudding user response:
here is one way to do it
# form a key by combining the column values in each row and assign to keys_list
keys_list= df2.apply(lambda x: ''.join(x.values), axis=1).to_list()
# form a key in DF1 and check for non-existence in df2
df1[df1.apply(lambda x: ''.join(x.values) not in keys_list , axis=1)]
first_column second_column third_column
1 3 2 1