Pandas better method to compare two dataframes and find entries that only exist in one-CodePudding

I have a DataFrame/spreadsheet that has columns for employee info (name, worksite) and also columns for total hours worked. My main goal is to find employees that exist in one file but not the other.

DataFrame ORIGINAL:

 Name      Site    ....other columns
 Anne      A
 Bob       B
 Charlie   A

Dataframe NEW:

 Name      Site    ....other columns
 Anne      A
 Bob       B
 Doug      B

DataFrame NEW is very similar to ORIGINAL with a few differences and these are the details I am looking to show

Charlie/A was only in ORIGINAL
Doug/B was only in NEW

I found this solution, which works okay but I need to perform it twice to find records in one DataFrame and not the other, and then again but vice-versa.

Here is the code I have:

COLS = ['Name','Site'] # Columns to group by to find a 'unique' record

# Records in New, not in Original
df_right = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='right', indicator=True)
df_right = df_right[df_right._merge != 'both'] # Filter out records that exist in both.

# Records in Original, not in New
df_left = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='left', indicator=True)
df_left = df_left[df_left._merge != 'both']

df = pd.concat([df_left,df_right])
# df now contains Name/Site records that exist in one DataFrame but not the other

Is there a better way to perform this check instead of doing it twice and concatenating?

CodePudding user response：

You can actually convert the dataframes into Indexes, and then simply use isin to check if the whole rows are in the other dataframe:

cols = ['Name', 'Site']
originalI = pd.Index(ORIGINAL[cols])
newI = pd.Index(NEW[cols])

out = pd.concat([
    ORIGINAL[~originalI.isin(newI)].assign(from_df='ORIGINAL'),
    NEW[~newI.isin(originalI)].assign(from_df='NEW'),    
])

Output:

>>> out
      Name Site   from_df
2  Charlie    A  ORIGINAL
2     Doug    B       NEW

CodePudding user response：

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
  Charlie    A     left_only
  Doug       B     right_only

CodePudding user response：

If you only want the "Name"-"Site" pairs; then I think this could work:

out = (pd.concat((ORIGINAL[['Name','Site']].assign(From='ORIGINAL'), 
                  NEW[['Name','Site']].assign(From='NEW')))
       .drop_duplicates(subset=['Name', 'Site'], keep=False))

Output:

      Name Site      From
2  Charlie    A  ORIGINAL
2     Doug    B       NEW