I am trying to merge two pandas dataframes each consisting two string columns and one date column.
df1
a b date
100 200 2022-01-03
100 200 2022-01-04
101 200 2022-01-05
101 200 2022-01-06
101 200 2022-01-07
df2
a b date
100 200 2022-01-04
100 200 2022-01-06
101 200 2022-01-03
101 200 2022-01-06
101 200 2022-01-09
The goal is to merge them on a, b, date and take the closest date (forward direction). Desired output:
df
a b date_x date_y
100 200 2022-01-03 2022-01-04
100 200 2022-01-04 2022-01-04
101 200 2022-01-05 2022-01-06 (not 2022-01-03 because it is behind not forward)
101 200 2022-01-06 2022-01-06
101 200 2022-01-07 2022-01-09
CodePudding user response:
We can merge on a
and b
, filter by the min difference between date_y
and date_x
taking into account forward direction
new_df = df1.merge(df2, on=['a', 'b'], how='inner')\
.assign(diff_date=lambda df: df['date_y']
.sub(df['date_x'])
.where(lambda x: df['date_y'].ge(df['date_x'])),
mask=lambda df: df['diff_date']
.eq(df.groupby(['a', 'b', 'date_x'])['diff_date']
.transform('min')))\
.loc[lambda df: df['mask']]\
.drop(['diff_date', 'mask'], axis=1)
print(new_df)
a b date_x date_y
0 100 200 2022-01-03 2022-01-04
2 100 200 2022-01-04 2022-01-04
5 101 200 2022-01-05 2022-01-06
8 101 200 2022-01-06 2022-01-06
12 101 200 2022-01-07 2022-01-09
CodePudding user response:
You can also try
# merge on a,b and sort based on date
m = df1.merge(df2, on=['a', 'b'], how='left').sort_values(['date_x', 'date_y'])
# only keep dates that are <= df2 date
df = m[m['date_x'] <= m['date_y']]
# drop duplicates and filter
final_df = df.loc[df[['a', 'b', 'date_x']].drop_duplicates(keep='first').index]
a b date_x date_y
0 100 200 2022-01-03 2022-01-04
2 100 200 2022-01-04 2022-01-04
5 101 200 2022-01-05 2022-01-06
8 101 200 2022-01-06 2022-01-06
12 101 200 2022-01-07 2022-01-09
CodePudding user response:
import pandas as pd
df1 = pd.DataFrame({'a': ['100', '100', '101', '101', '101'],
'b': ['200', '200', '200', '200', '200'],
'date': ['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07']})
df2 = pd.DataFrame({'a': ['100', '100', '101', '101', '101'],
'b': ['200', '200', '200', '200', '200'],
'date': ['2022-01-04', '2022-01-06', '2022-01-03', '2022-01-06', '2022-01-09']})
df3 = pd.merge(df1,df2,how='left',left_on=['a','b'],right_on=['b','a']).drop(['a_y','b_y'], axis=1)
df3['date_y'] = df2['date']