Home > other >  Cannot use pandas.merge_asof with multiple columns
Cannot use pandas.merge_asof with multiple columns

Time:01-19

I am trying to merge two pandas dataframes each consisting two string columns and one date column.

df1
a    b      date
100  200    2022-01-03
100  200    2022-01-04
101  200    2022-01-05
101  200    2022-01-06
101  200    2022-01-07

df2
a    b      date
100  200    2022-01-04
100  200    2022-01-06
101  200    2022-01-03
101  200    2022-01-06
101  200    2022-01-09

The goal is to merge them on a, b, date and take the closest date (forward direction). Desired output:

df
a    b      date_x      date_y
100  200    2022-01-03  2022-01-04
100  200    2022-01-04  2022-01-04
101  200    2022-01-05  2022-01-06 (not 2022-01-03 because it is behind not forward)
101  200    2022-01-06  2022-01-06
101  200    2022-01-07  2022-01-09

CodePudding user response:

We can merge on a and b, filter by the min difference between date_y and date_x taking into account forward direction

new_df = df1.merge(df2, on=['a', 'b'], how='inner')\
            .assign(diff_date=lambda df: df['date_y']
                        .sub(df['date_x'])
                        .where(lambda x: df['date_y'].ge(df['date_x'])), 
                    mask=lambda df: df['diff_date']
                        .eq(df.groupby(['a', 'b', 'date_x'])['diff_date']
                        .transform('min')))\
            .loc[lambda df: df['mask']]\
            .drop(['diff_date', 'mask'], axis=1)
print(new_df)


    a    b     date_x     date_y
0   100  200 2022-01-03 2022-01-04
2   100  200 2022-01-04 2022-01-04
5   101  200 2022-01-05 2022-01-06
8   101  200 2022-01-06 2022-01-06
12  101  200 2022-01-07 2022-01-09

CodePudding user response:

You can also try

# merge on a,b and sort based on date
m = df1.merge(df2, on=['a', 'b'], how='left').sort_values(['date_x', 'date_y'])
# only keep dates that are <= df2 date
df = m[m['date_x'] <= m['date_y']]
# drop duplicates and filter
final_df = df.loc[df[['a', 'b', 'date_x']].drop_duplicates(keep='first').index]

      a    b     date_x     date_y
0   100  200 2022-01-03 2022-01-04
2   100  200 2022-01-04 2022-01-04
5   101  200 2022-01-05 2022-01-06
8   101  200 2022-01-06 2022-01-06
12  101  200 2022-01-07 2022-01-09

CodePudding user response:

import pandas as pd

df1 = pd.DataFrame({'a': ['100', '100', '101', '101', '101'],
                    'b': ['200', '200', '200', '200', '200'],
                    'date': ['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07']})

df2 = pd.DataFrame({'a': ['100', '100', '101', '101', '101'],
                    'b': ['200', '200', '200', '200', '200'],
                    'date': ['2022-01-04', '2022-01-06', '2022-01-03', '2022-01-06', '2022-01-09']})
  
df3 = pd.merge(df1,df2,how='left',left_on=['a','b'],right_on=['b','a']).drop(['a_y','b_y'], axis=1) 
df3['date_y'] = df2['date']
  • Related