Home > front end >  How to join in pandas on a.date<=b.date and then taking the only the row where a.date is max?
How to join in pandas on a.date<=b.date and then taking the only the row where a.date is max?

Time:05-26

I am trying to join two dataframes by ID and Date. However, the date criteria is that the a.date<=b.date and in case a.date has a many results, then taking the max value (but still <b.date). How would I do that?

Dataframe A (cumulative sales table)
ID| date       | cumulative_sales
1 | 2020-01-01 | 10
1 | 2020-01-03 | 15
1 | 2021-01-02 | 20

Dataframe B
ID| date       | cumulative_sales (up to this date, how much was purchased for a given ID?)
1 | 2020-05-01 | 15

In SQL, I would do a join by a.date<=b.date, then I would next do a dense_rank() and take the max value within that partition for each ID. Not sure how to approach this with Pandas. Any suggestion?

CodePudding user response:

We can do merge

out = df1.merge(df2, on = 'ID', suffixes = ('','_x')).\
            query('date<=date_x').sort_values('date').drop_duplicates('ID',keep='last')[df1.columns]
Out[272]: 
   ID          date  cumulative_sales
1   1   2020-01-03                 15

CodePudding user response:

Here's a way to do what your question asks:

dfA = dfA.sort_values(['ID', 'date']).join(
    dfB.set_index('ID'), on='ID', rsuffix='_b').query('date <= date_b').drop(
    columns='date_b').groupby(['ID']).last().reset_index()

Explanation:

  • sort dfA by ID, date
  • use join to join with dfB on ID and bring the columns from dfB in with the suffix _b
  • use query to keep only rows where dfA.date <= dfB.date
  • use groupby on ID and then last to select the row with the highest remaining value of dfA.date (i.e., the highest dfA.date that is <= dfB.date for each ID)
  • use reset_index to convert ID from an index level back into a column label

Full test code:

import pandas as pd
dfA = pd.DataFrame({'ID':[1,1,1,2,2,2], 'date':['2020-01-01','2020-01-03','2020-01-02','2020-01-01','2020-01-03','2020-01-02'], 'cumulative_sales':[10,15,20,30,40,50]})
dfB = pd.DataFrame({'ID':[1,2], 'date':['2020-05-01','2020-01-01'], 'cumulative_sales':[15,30]})
print(dfA)
print(dfB)

dfA = dfA.sort_values(['ID', 'date']).join(
    dfB.set_index('ID'), on='ID', rsuffix='_b').query(
    'date <= date_b').drop(columns='date_b').groupby(['ID']).last().reset_index()
print(dfA)

Input:

dfA:

   ID        date  cumulative_sales
0   1  2020-01-01                10
1   1  2020-01-03                15
2   1  2020-01-02                20
3   2  2020-01-01                30
4   2  2020-01-03                40
5   2  2020-01-02                50

dfB:

   ID        date  cumulative_sales
0   1  2020-05-01                15
1   2  2020-01-01                30

Output:

   ID        date  cumulative_sales  cumulative_sales_b
0   1  2020-01-03                15                  15
1   2  2020-01-01                30                  30

Note: I have left cumulative_sales_b in place in case you want it. If it's not needed, it can be dropped by replacing drop(columns='date_b') with drop(columns=['date_b', 'cumulative_sales_b']).

UPDATE:

For fun, if your version of python has the walrus operator := (also known as "conditional assignment" operator), you can do this instead of using query:

dfA = (dfA := dfA.sort_values(['ID', 'date']).join(
    dfB.set_index('ID'), on='ID', rsuffix='_b'))[dfA.date <= dfA.date_b].drop(
    columns='date_b').groupby(['ID']).last().reset_index()

CodePudding user response:

Looks like you simply want a merge_asof:

dfA['date'] = pd.to_datetime(dfA['date'])
dfB['date'] = pd.to_datetime(dfB['date'])

out = pd.merge_asof(dfB.sort_values(by='date'),
                    dfA.sort_values(by='date'), 
                    on'date', by='ID')
  • Related