I have two dataframes, and I want to merge them by comparing the date on different columns.
This is the dataframe with the work unit and date listed. Cum work denotes the number of jobs completed by the worker before. This workunit's leader is identified as leader name.
Work_unit Date leader_name Cum_work
unit1 11/12/2017 Bob 2
unit2 12/14/2018 David 1
unit3 12/13/2019 Ada 3
unit4 1/1/2019 Bob 3
unit5 12/13/2019 Ada 4
This is another dataframe that displays collaborators for each task. The date indicates the day that this collaborator worked on the task.
Work_unit Date collaborators
unit2 12/13/2018 Bob
unit2 12/30/2018 Ada
unit3 1/10/2019 Bob
unit3 2/3/2019 Casey
unit4 12/30/2020 Ada
unit4 12/31/2020 David
I'd like to combine two dataframes to demonstrate work collaboration and shows the collaborators' seniority.
For example, Bob is the workunit 1 leader, but he is also a collaborator in unit 2. I'd like to map Bob's cumulative work from dataframe1 (the number of leaderworks he's completed) to dataframe2 by comparing the date, so the Cumworks_sofar would be 2 for Bob in unit2.
Thus the merged Dataframe would be
Work_unit Date collaborators Cumworks_sofar
unit2 12/13/2018 Bob 2
unit2 12/30/2018 Ada 3
unit3 1/10/2019 Bob 3
unit3 2/3/2019 Casey 0
unit4 12/30/2020 Ada 4
unit4 12/31/2020 David 1
Thank you for your help
CodePudding user response:
merge_asof
with backward direction,
df1.sort_values(by=['Date'], inplace=True)
output = pd.merge_asof(df2, df1, on='Date', left_by='collaborators', right_by='leader_name', direction='backward')
output.drop(['Work_unit_y', 'leader_name'], axis=1, inplace=True)
output.rename(columns={'Work_unit_x': 'Work_unit'}, inplace=True)
output
###
Work_unit Date collaborators Cum_work
0 unit2 2018-12-13 Bob 2.0
1 unit2 2018-12-30 Ada NaN
2 unit3 2019-01-10 Bob 3.0
3 unit3 2019-02-03 Casey NaN
4 unit4 2020-12-30 Ada 4.0
5 unit4 2020-12-31 David 1.0
I think the first record from Ada should be no record, cause she didn't work before 2018-12-30
since Cum_work
is a cumulative metric.
If Ada issue is resolved, the last part is to fill NaN:
output.fillna(0, inplace=True)