I am trying to join two dataframes by ID and Date. However, the date criteria is that the a.date<=b.date and in case a.date has a many results, then taking the max value (but still <b.date). How would I do that?
Dataframe A (cumulative sales table)
ID| date | cumulative_sales
1 | 2020-01-01 | 10
1 | 2020-01-03 | 15
1 | 2021-01-02 | 20
Dataframe B
ID| date | cumulative_sales (up to this date, how much was purchased for a given ID?)
1 | 2020-05-01 | 15
In SQL, I would do a join by a.date<=b.date, then I would next do a dense_rank() and take the max value within that partition for each ID. Not sure how to approach this with Pandas. Any suggestion?
CodePudding user response:
We can do merge
out = df1.merge(df2, on = 'ID', suffixes = ('','_x')).\
query('date<=date_x').sort_values('date').drop_duplicates('ID',keep='last')[df1.columns]
Out[272]:
ID date cumulative_sales
1 1 2020-01-03 15
CodePudding user response:
Here's a way to do what your question asks:
dfA = dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b').query('date <= date_b').drop(
columns='date_b').groupby(['ID']).last().reset_index()
Explanation:
- sort
dfA
byID, date
- use
join
to join withdfB
onID
and bring the columns fromdfB
in with the suffix_b
- use
query
to keep only rows wheredfA.date
<=dfB.date
- use
groupby
onID
and thenlast
to select the row with the highest remaining value ofdfA.date
(i.e., the highestdfA.date
that is <=dfB.date
for eachID
) - use
reset_index
to convertID
from an index level back into a column label
Full test code:
import pandas as pd
dfA = pd.DataFrame({'ID':[1,1,1,2,2,2], 'date':['2020-01-01','2020-01-03','2020-01-02','2020-01-01','2020-01-03','2020-01-02'], 'cumulative_sales':[10,15,20,30,40,50]})
dfB = pd.DataFrame({'ID':[1,2], 'date':['2020-05-01','2020-01-01'], 'cumulative_sales':[15,30]})
print(dfA)
print(dfB)
dfA = dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b').query(
'date <= date_b').drop(columns='date_b').groupby(['ID']).last().reset_index()
print(dfA)
Input:
dfA:
ID date cumulative_sales
0 1 2020-01-01 10
1 1 2020-01-03 15
2 1 2020-01-02 20
3 2 2020-01-01 30
4 2 2020-01-03 40
5 2 2020-01-02 50
dfB:
ID date cumulative_sales
0 1 2020-05-01 15
1 2 2020-01-01 30
Output:
ID date cumulative_sales cumulative_sales_b
0 1 2020-01-03 15 15
1 2 2020-01-01 30 30
Note: I have left cumulative_sales_b
in place in case you want it. If it's not needed, it can be dropped by replacing drop(columns='date_b')
with drop(columns=['date_b', 'cumulative_sales_b'])
.
UPDATE:
For fun, if your version of python has the walrus operator :=
(also known as "conditional assignment" operator), you can do this instead of using query
:
dfA = (dfA := dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b'))[dfA.date <= dfA.date_b].drop(
columns='date_b').groupby(['ID']).last().reset_index()
CodePudding user response:
Looks like you simply want a merge_asof
:
dfA['date'] = pd.to_datetime(dfA['date'])
dfB['date'] = pd.to_datetime(dfB['date'])
out = pd.merge_asof(dfB.sort_values(by='date'),
dfA.sort_values(by='date'),
on'date', by='ID')