Home > Back-end >  Python Pandas - How to replicate pandas merge_asof on multiple columns or groupbys
Python Pandas - How to replicate pandas merge_asof on multiple columns or groupbys

Time:01-23

I have two time series tables in pandas dataframes, table_a and table_b, that I am trying to merge into table_joined that effectively replicates an excel xlookup exact or greater than match OR a pandas merge_asof method. Example tables below:

table_a =

lookup_id date a_value
A 2021-01-03 50
A 2021-01-04 51
A 2021-01-05 52
B 2021-01-03 20
B 2021-01-04 21
B 2021-01-05 22

table_b =

lookup_id date c_value
A 2015-01-01 A_1
A 2021-01-05 A_2
B 2015-01-01 B_1
B 2021-01-05 B_2

table_joined =

lookup_id date a_value b_value
A 2021-01-03 50 A_1
A 2021-01-04 51 A_1
A 2021-01-05 52 A_2
B 2021-01-03 20 B_1
B 2021-01-04 21 B_1
B 2021-01-05 22 B_2

From my understanding, pandas merge_asof can only be done on one column and cannot do something like:

import pandas as pd
df1 = table_a
df2 = table_b
df3 = pd.merge_asof(
    table_a, table_b, left_on=['date', 'lookup_id'], 
    right_on=['date', 'lookup_id'], allow_exact_matches=True,
    direction='backwards')

I have also tried a simple pd.merge case thinking I could fill down to the next value with ffill, but because dates in tableb can start before the first date in tablea OR could be on a date between two dates in table_a, the merge doesn't always capture the first occurrence.

Because that isn't possible, I'm having to filter to lookup_id subsets and loop through merge_asof operations and string them together which is insanely slow on the size of tables I am working with. Is there a vectorized solution to this problem? Perhaps using a groupby or apply method? Alternatively, would this be better done with a query out of psql?

CodePudding user response:

Let us fix your code:

pd.merge_asof(
    df1.sort_values('date'), 
    df2.sort_values('date'), 
    on='date', by=['lookup_id'], direction='backward'
).sort_values(['lookup_id', 'date'])

Result

  lookup_id       date  a_value c_value
0         A 2021-01-03       50     A_1
2         A 2021-01-04       51     A_1
4         A 2021-01-05       52     A_2
1         B 2021-01-03       20     B_1
3         B 2021-01-04       21     B_1
5         B 2021-01-05       22     B_2

Couple of notes about merge_asof:

  • on parameter is a single column label as opposed to list of columns
  • Both the dataframes must be sorted by the on label before performing the asof merge
  • To ensure the exact matches on lookup_id you can specify the by parameter
  • Related