Home > Software design >  Merge dataframes by comparing the date, Pandas
Merge dataframes by comparing the date, Pandas

Time:09-27

I have two dataframes, and I want to merge them by comparing the date on different columns.

This is the dataframe with the work unit and date listed. Cum work denotes the number of jobs completed by the worker before. This workunit's leader is identified as leader name.

Work_unit    Date      leader_name  Cum_work

unit1       11/12/2017      Bob           2
unit2       12/14/2018      David         1
unit3       12/13/2019      Ada           3
unit4       1/1/2019        Bob           3
unit5       12/13/2019      Ada           4

This is another dataframe that displays collaborators for each task. The date indicates the day that this collaborator worked on the task.

Work_unit     Date      collaborators 

unit2       12/13/2018      Bob
unit2       12/30/2018      Ada
unit3       1/10/2019       Bob
unit3       2/3/2019        Casey
unit4       12/30/2020      Ada
unit4       12/31/2020      David

I'd like to combine two dataframes to demonstrate work collaboration and shows the collaborators' seniority.

For example, Bob is the workunit 1 leader, but he is also a collaborator in unit 2. I'd like to map Bob's cumulative work from dataframe1 (the number of leaderworks he's completed) to dataframe2 by comparing the date, so the Cumworks_sofar would be 2 for Bob in unit2.

Thus the merged Dataframe would be

Work_unit     Date   collaborators   Cumworks_sofar

unit2       12/13/2018      Bob              2
unit2       12/30/2018      Ada              3
unit3       1/10/2019       Bob              3
unit3       2/3/2019        Casey            0
unit4       12/30/2020      Ada              4
unit4       12/31/2020      David            1

Thank you for your help

CodePudding user response:

merge_asof with backward direction,

df1.sort_values(by=['Date'], inplace=True)

output = pd.merge_asof(df2, df1, on='Date', left_by='collaborators', right_by='leader_name', direction='backward')
output.drop(['Work_unit_y', 'leader_name'], axis=1, inplace=True)
output.rename(columns={'Work_unit_x': 'Work_unit'}, inplace=True)
output
###
  Work_unit       Date collaborators  Cum_work
0     unit2 2018-12-13           Bob       2.0
1     unit2 2018-12-30           Ada       NaN
2     unit3 2019-01-10           Bob       3.0
3     unit3 2019-02-03         Casey       NaN
4     unit4 2020-12-30           Ada       4.0
5     unit4 2020-12-31         David       1.0

I think the first record from Ada should be no record, cause she didn't work before 2018-12-30 since Cum_work is a cumulative metric.



If Ada issue is resolved, the last part is to fill NaN:

output.fillna(0, inplace=True)

  • Related