I want to compare 2 different sized dataframes, which both can include duplicates, so df.Compare(df1) does not work, neither does merging and removing duplicates. The indexes can be out of place, so those are also not very helpful I believe. Using iterrows() or something similar to compare each row and deleting them also seems like a barbaric/slow solution, so I bet there are more sophisticated solutions out there.
The best description I can think of is an OUTER JOIN with NO intersection, with the difference in the amount of duplicates. So if an item is 3x in df, and 1x in df1, it should add the the item in df 2x if they are equal. Only the index is allowed to be different, as these can be scrambled.
Here are two example dataframes and my expected output:
df:
Animal Name
0 Cat Peter
1 Dog Kate
2 Rabbit Bas
3 Dog Kate
4 Dog Robert
df1:
Animal Name
0 Snake Nate
1 Cat Peter
2 Dog Kate
3 Rabbit Bas
expected output:
Animal Name
0 Dog Kate
1 Snake Nate
2 Dog Robert
Order does not matter, I just need the differences. Could also be two dataframes, with each containing the elements not in the other.
Does anyone know how to go about this? Thanks in advance!
Here is the code to create the DataFrames above:
import pandas as pd
df = pd.DataFrame({'Animal': ['Cat', 'Dog', 'Rabbit', 'Dog', 'Dog'], 'Name': ['Peter', 'Kate', 'Bas', 'Kate', 'Robert']})
df1 = pd.DataFrame({'Animal': ['Snake', 'Cat', 'Dog', 'Rabbit'], 'Name': ['Nate', 'Peter', 'Kate', 'Bas']})
CodePudding user response:
Use a merge
with indicator after de-duplication, and filtering out the "both":
(df.assign(key=df.groupby(list(df)).cumcount())
.merge(df1.assign(key=df1.groupby(list(df)).cumcount()),
how='outer', indicator=True)
.query('_merge != "both"')
)
Output:
Animal Name key _merge
3 Dog Kate 1 left_only
4 Dog Robert 0 left_only
5 Snake Nate 0 right_only
CodePudding user response:
IIUC the problem, you could use value_counts
take difference between series and turn it to frame.
s = df.value_counts().sub(df1.value_counts()).fillna(1)
out = s[s.ne(0)].reset_index()[['Animal', 'Name']]
print(out)
Animal Name
0 Dog Kate
1 Dog Robert
2 Snake Nate