I have two dataframes as per below:
df1 = [['tom', 10],['nick',15], ['juli',14]]
df2= [['juli', 14],['daniel',15], ['tom',10], ['tom',10]]
Please note that the dataframes might not have the same index order, but I would like to receive the differences between them as an output.
So I would expect the output to be:
['nick',15]
['daniel',15]
['tom',10]
If possible, I would like to know which data frame contains each value. In the example provided above, the first item (['nick',15]
) belongs to df1
and the others to df2
.
Bonus: Is there a way to export the output to Excel?
CodePudding user response:
Assuming this input:
df1 = pd.DataFrame([['tom', 10],['nick',15], ['juli',14]])
df2 = pd.DataFrame([['juli', 14],['daniel',15], ['tom',10], ['tom',10]])
You could use merge
with the indicator
option.
The rationale here is to create an additional column with an index per group to identify the duplicates.
cols = list(df1.columns)
(df1.assign(idx=df1.groupby(cols).cumcount())
.merge(df2.assign(idx=df2.groupby(cols).cumcount()),
on=list(df1.columns) ['idx'],
indicator=True,
how='outer')
.drop('idx', axis=1)
.query('_merge != "both"')
#.to_excel('output.xlsx') ## uncomment to export as xlsx
)
output:
0 1 _merge
1 nick 15 left_only
3 daniel 15 right_only
4 tom 10 right_only