I have a DataFrame/spreadsheet that has columns for employee info (name, worksite) and also columns for total hours worked. My main goal is to find employees that exist in one file but not the other.
DataFrame ORIGINAL
:
Name Site ....other columns
Anne A
Bob B
Charlie A
Dataframe NEW
:
Name Site ....other columns
Anne A
Bob B
Doug B
DataFrame NEW
is very similar to ORIGINAL
with a few differences and these are the details I am looking to show
- Charlie/A was only in
ORIGINAL
- Doug/B was only in
NEW
I found this solution, which works okay but I need to perform it twice to find records in one DataFrame and not the other, and then again but vice-versa.
Here is the code I have:
COLS = ['Name','Site'] # Columns to group by to find a 'unique' record
# Records in New, not in Original
df_right = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='right', indicator=True)
df_right = df_right[df_right._merge != 'both'] # Filter out records that exist in both.
# Records in Original, not in New
df_left = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='left', indicator=True)
df_left = df_left[df_left._merge != 'both']
df = pd.concat([df_left,df_right])
# df now contains Name/Site records that exist in one DataFrame but not the other
Is there a better way to perform this check instead of doing it twice and concatenating?
CodePudding user response:
You can actually convert the dataframes into Index
es, and then simply use isin
to check if the whole rows are in the other dataframe:
cols = ['Name', 'Site']
originalI = pd.Index(ORIGINAL[cols])
newI = pd.Index(NEW[cols])
out = pd.concat([
ORIGINAL[~originalI.isin(newI)].assign(from_df='ORIGINAL'),
NEW[~newI.isin(originalI)].assign(from_df='NEW'),
])
Output:
>>> out
Name Site from_df
2 Charlie A ORIGINAL
2 Doug B NEW
CodePudding user response:
Looks like using 'outer' as the how
was the solution
z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both
Output looks like this (after only showing the columns I care about)
Name Site _merge
Charlie A left_only
Doug B right_only
CodePudding user response:
If you only want the "Name"-"Site" pairs; then I think this could work:
out = (pd.concat((ORIGINAL[['Name','Site']].assign(From='ORIGINAL'),
NEW[['Name','Site']].assign(From='NEW')))
.drop_duplicates(subset=['Name', 'Site'], keep=False))
Output:
Name Site From
2 Charlie A ORIGINAL
2 Doug B NEW