Home > front end >  How to compare 2 different sized pandas DataFrames including duplicates?
How to compare 2 different sized pandas DataFrames including duplicates?

Time:01-13

I want to compare 2 different sized dataframes, which both can include duplicates, so df.Compare(df1) does not work, neither does merging and removing duplicates. The indexes can be out of place, so those are also not very helpful I believe. Using iterrows() or something similar to compare each row and deleting them also seems like a barbaric/slow solution, so I bet there are more sophisticated solutions out there.

The best description I can think of is an OUTER JOIN with NO intersection, with the difference in the amount of duplicates. So if an item is 3x in df, and 1x in df1, it should add the the item in df 2x if they are equal. Only the index is allowed to be different, as these can be scrambled.

Here are two example dataframes and my expected output:

df:
     Animal        Name
0    Cat           Peter
1    Dog           Kate
2    Rabbit        Bas
3    Dog           Kate
4    Dog           Robert

df1:
     Animal        Name
0    Snake         Nate
1    Cat           Peter
2    Dog           Kate
3    Rabbit        Bas

expected output:
     Animal        Name
0    Dog           Kate
1    Snake         Nate
2    Dog           Robert

Order does not matter, I just need the differences. Could also be two dataframes, with each containing the elements not in the other.

Does anyone know how to go about this? Thanks in advance!

Here is the code to create the DataFrames above:

import pandas as pd

df = pd.DataFrame({'Animal': ['Cat', 'Dog', 'Rabbit', 'Dog', 'Dog'], 'Name': ['Peter', 'Kate', 'Bas', 'Kate', 'Robert']})
df1 = pd.DataFrame({'Animal': ['Snake', 'Cat', 'Dog', 'Rabbit'], 'Name': ['Nate', 'Peter', 'Kate', 'Bas']})

CodePudding user response:

Use a merge with indicator after de-duplication, and filtering out the "both":

(df.assign(key=df.groupby(list(df)).cumcount())
   .merge(df1.assign(key=df1.groupby(list(df)).cumcount()),
          how='outer', indicator=True)
   .query('_merge != "both"')
)

Output:

  Animal    Name  key      _merge
3    Dog    Kate    1   left_only
4    Dog  Robert    0   left_only
5  Snake    Nate    0  right_only

CodePudding user response:

IIUC the problem, you could use value_counts take difference between series and turn it to frame.

s = df.value_counts().sub(df1.value_counts()).fillna(1)
out = s[s.ne(0)].reset_index()[['Animal', 'Name']]

print(out)

  Animal    Name
0    Dog    Kate
1    Dog  Robert
2  Snake    Nate
  • Related