I have two Pandas
DataFrames that i need to merge to get the full history of a group of employees.
Both frames contains the employee ID and a timestamp for when the data was valid. But the timestamps in one table are sometimes a bit "late" meaning that the second dataframe has a timestamp is either on or after the timestamp in the first dataframe.
Example Data
import pandas as pd
df = pd.DataFrame(
{'ID': [1000, 1000, 1002, 1003, 1004],
'Start Date': [
pd.Timestamp('2021-01-01'),
pd.Timestamp('2021-01-01'),
pd.Timestamp('2021-04-01'),
pd.Timestamp('2021-01-01'),
pd.Timestamp('2021-01-01')
],
'Name': ['Joe', 'Bill', 'Bill', 'Martin', 'Jane']
}
)
df2 = pd.DataFrame(
{'ID': [1000, 1002, 1002, 1003, 1004],
'Start Date': [
pd.Timestamp('2021-01-01'),
pd.Timestamp('2021-01-01'),
pd.Timestamp('2021-04-05'),
pd.Timestamp('2021-01-01'),
pd.Timestamp('2021-01-10')
],
'Position': ['Operator', 'Operator', 'Manager', 'Operator', 'Operator']})
df = df.merge(df2, how='left', on=['ID', 'Start Date'])
print(df)
Outputs
df
ID Start Date Name
0 1000 2021-01-01 Joe
1 1000 2021-01-01 Bill
2 1002 2021-04-01 Bill
3 1003 2021-01-01 Martin
4 1004 2021-01-01 Jane
df2
ID Start Date Position
0 1000 2021-01-01 Operator
1 1002 2021-01-01 Operator
2 1002 2021-04-05 Manager
3 1003 2021-01-01 Operator
4 1004 2021-01-10 Operator
merged df
ID Start Date Name Position
0 1000 2021-01-01 Joe Operator
1 1000 2021-01-01 Bill Operator
2 1002 2021-04-01 Bill NaN
3 1003 2021-01-01 Martin Operator
4 1004 2021-01-01 Jane NaN
I want to either arrange the code, or modify the data, so that the data for bill and Jane in one dataframe is merged with the shifted lines from the other data frame.
Some solutions that I have thought about (but don't know how to do) are:
- Fill the data in the second dataframe so that all all lines are represented within a given date range (e.g. make sure that Bills "Manager" title exist on all dates from 2021-01-02 and onwards and so on for all IDs)
- Exchange the merge command with one that is allowed to "seek forward" in the second dataframe until it finds a match.
CodePudding user response:
You can try with merge_asof
out = pd.merge_asof(df.sort_values('Start Date'),
df2.sort_values('Start Date'),
by = 'ID',
on = 'Start Date',
direction = 'forward')
Out[215]:
ID Start Date Name Position
0 1000 2021-01-01 Joe Operator
1 1000 2021-01-01 Bill Operator
2 1003 2021-01-01 Martin Operator
3 1004 2021-01-01 Jane Operator
4 1002 2021-04-01 Bill Manager
CodePudding user response:
I'm am not sure I understood the same as BENY, I feel it makes more sense to keep only the latest per ID?
(df.merge(df2, how='left', on='ID', suffixes=('_drop', ''))
.loc[lambda d: ~d['ID'].duplicated(keep='last')]
.drop('Start Date_drop', axis=1)
)
output:
ID Name Start Date Position
1 1000 Bill 2021-01-01 Operator
3 1002 Bill 2021-04-05 Manager
4 1003 Martin 2021-01-01 Operator
5 1004 Jane 2021-01-10 Operator