Merging two dataframes while seeking forward-CodePudding

I have two Pandas DataFrames that i need to merge to get the full history of a group of employees.

Both frames contains the employee ID and a timestamp for when the data was valid. But the timestamps in one table are sometimes a bit "late" meaning that the second dataframe has a timestamp is either on or after the timestamp in the first dataframe.

Example Data

import pandas as pd

df = pd.DataFrame(
    {'ID': [1000, 1000, 1002, 1003, 1004],
        'Start Date': [
            pd.Timestamp('2021-01-01'),
            pd.Timestamp('2021-01-01'),
            pd.Timestamp('2021-04-01'),
            pd.Timestamp('2021-01-01'),
            pd.Timestamp('2021-01-01')
    ],
        'Name': ['Joe', 'Bill', 'Bill', 'Martin', 'Jane']
    }
)

df2 = pd.DataFrame(
    {'ID': [1000, 1002, 1002, 1003, 1004],
        'Start Date': [
            pd.Timestamp('2021-01-01'),
            pd.Timestamp('2021-01-01'),
            pd.Timestamp('2021-04-05'),
            pd.Timestamp('2021-01-01'),
            pd.Timestamp('2021-01-10')
    ],
        'Position': ['Operator', 'Operator', 'Manager', 'Operator', 'Operator']})

df = df.merge(df2, how='left', on=['ID', 'Start Date'])
print(df)

Outputs

     ID Start Date    Name
0  1000 2021-01-01     Joe
1  1000 2021-01-01    Bill
2  1002 2021-04-01    Bill
3  1003 2021-01-01  Martin
4  1004 2021-01-01    Jane

df2

     ID Start Date  Position
0  1000 2021-01-01  Operator
1  1002 2021-01-01  Operator
2  1002 2021-04-05   Manager
3  1003 2021-01-01  Operator
4  1004 2021-01-10  Operator

merged df

     ID Start Date    Name  Position
0  1000 2021-01-01     Joe  Operator
1  1000 2021-01-01    Bill  Operator
2  1002 2021-04-01    Bill       NaN
3  1003 2021-01-01  Martin  Operator
4  1004 2021-01-01    Jane       NaN

I want to either arrange the code, or modify the data, so that the data for bill and Jane in one dataframe is merged with the shifted lines from the other data frame.

Some solutions that I have thought about (but don't know how to do) are:

Fill the data in the second dataframe so that all all lines are represented within a given date range (e.g. make sure that Bills "Manager" title exist on all dates from 2021-01-02 and onwards and so on for all IDs)
Exchange the merge command with one that is allowed to "seek forward" in the second dataframe until it finds a match.

CodePudding user response：

You can try with merge_asof

out = pd.merge_asof(df.sort_values('Start Date'),
                    df2.sort_values('Start Date'),
                    by = 'ID',
                    on = 'Start Date',
                    direction = 'forward')
Out[215]: 
     ID Start Date    Name  Position
0  1000 2021-01-01     Joe  Operator
1  1000 2021-01-01    Bill  Operator
2  1003 2021-01-01  Martin  Operator
3  1004 2021-01-01    Jane  Operator
4  1002 2021-04-01    Bill   Manager

CodePudding user response：

I'm am not sure I understood the same as BENY, I feel it makes more sense to keep only the latest per ID?

(df.merge(df2, how='left', on='ID', suffixes=('_drop', ''))
   .loc[lambda d: ~d['ID'].duplicated(keep='last')]
   .drop('Start Date_drop', axis=1)
)

output:

     ID    Name Start Date  Position
1  1000    Bill 2021-01-01  Operator
3  1002    Bill 2021-04-05   Manager
4  1003  Martin 2021-01-01  Operator
5  1004    Jane 2021-01-10  Operator