I have two time series tables in pandas dataframes, table_a and table_b, that I am trying to merge into table_joined that effectively replicates an excel xlookup exact or greater than match OR a pandas merge_asof method. Example tables below:
table_a =
lookup_id | date | a_value |
---|---|---|
A | 2021-01-03 | 50 |
A | 2021-01-04 | 51 |
A | 2021-01-05 | 52 |
B | 2021-01-03 | 20 |
B | 2021-01-04 | 21 |
B | 2021-01-05 | 22 |
table_b =
lookup_id | date | c_value |
---|---|---|
A | 2015-01-01 | A_1 |
A | 2021-01-05 | A_2 |
B | 2015-01-01 | B_1 |
B | 2021-01-05 | B_2 |
table_joined =
lookup_id | date | a_value | b_value |
---|---|---|---|
A | 2021-01-03 | 50 | A_1 |
A | 2021-01-04 | 51 | A_1 |
A | 2021-01-05 | 52 | A_2 |
B | 2021-01-03 | 20 | B_1 |
B | 2021-01-04 | 21 | B_1 |
B | 2021-01-05 | 22 | B_2 |
From my understanding, pandas merge_asof can only be done on one column and cannot do something like:
import pandas as pd
df1 = table_a
df2 = table_b
df3 = pd.merge_asof(
table_a, table_b, left_on=['date', 'lookup_id'],
right_on=['date', 'lookup_id'], allow_exact_matches=True,
direction='backwards')
I have also tried a simple pd.merge case thinking I could fill down to the next value with ffill, but because dates in tableb can start before the first date in tablea OR could be on a date between two dates in table_a, the merge doesn't always capture the first occurrence.
Because that isn't possible, I'm having to filter to lookup_id subsets and loop through merge_asof operations and string them together which is insanely slow on the size of tables I am working with. Is there a vectorized solution to this problem? Perhaps using a groupby or apply method? Alternatively, would this be better done with a query out of psql?
CodePudding user response:
Let us fix your code:
pd.merge_asof(
df1.sort_values('date'),
df2.sort_values('date'),
on='date', by=['lookup_id'], direction='backward'
).sort_values(['lookup_id', 'date'])
Result
lookup_id date a_value c_value
0 A 2021-01-03 50 A_1
2 A 2021-01-04 51 A_1
4 A 2021-01-05 52 A_2
1 B 2021-01-03 20 B_1
3 B 2021-01-04 21 B_1
5 B 2021-01-05 22 B_2
Couple of notes about merge_asof
:
on
parameter is a single column label as opposed to list of columns- Both the dataframes must be sorted by the
on
label before performing theasof
merge - To ensure the exact matches on
lookup_id
you can specify theby
parameter