I am new to python and i'm not sure how to achive what i'm trying to do. I have two pandas dataframes and I need to combine them. One dataframe is called 'cust'
and the other one is called 'items'
, Below is the code to create both the datasets:
import pandas as pd
cust = {'id': [212175, 286170, 361739, 297438, 415712],
'date_start': ['20/05/2022', '18/05/2022', '10/08/2021', '20/01/2022', '30/07/2021']}
items = {'id': [212175, 212175, 212175, 212175, 212175, 286170, 286170, 286170, 286170, 286170],
'item': ['PX002','PY005','PX003','PX003','NX002','PX002','PY005','PX003','PX003','NX002'],
'date_sent': ['21/05/2022','10/05/2022','01/06/2022','01/07/2021','15/08/2022','19/05/2022','10/07/2022','15/07/2022','10/04/2022','15/06/2022']}
# Create DataFrame
cust_df = pd.DataFrame(cust)
items_df = pd.DataFrame(items)
Cust_df:
id date_start
212175 20/05/2022
286170 18/05/2022
361739 10/08/2021
297438 20/01/2022
415712 30/07/2021
items_df:
id item date_sent
212175 PX002 21/05/2022
212175 PY005 10/05/2022
212175 PX003 01/06/2022
212175 PX003 01/07/2021
212175 NX002 15/08/2022
286170 PX002 19/05/2022
286170 PY005 10/07/2022
286170 PX003 15/07/2022
286170 PX003 10/04/2022
286170 NX002 15/06/2022
I need to find out all the items that were sent to every customer between two dates. The start date for each customer is different and should be taken from the date_start field present in the cust_df data frame. The end date for every customer is the same which is '12/08/2022'
. So, the date range for customer 212175
will be 20/05/2022 - 12/08/2022
, and i need to find out all the items sent to this customer between these dates. My desired output dataframe should look like the below:
Output_dataframe:
id item date_sent date_start
212175 PX002 21/05/2022 20/05/2022
212175 PX003 01/06/2022 20/05/2022
212175 PX005 10/05/2022 20/05/2022
286170 PX002 19/05/2022 18/05/2022
286170 PY005 10/07/2022 18/05/2022
286170 PX003 15/07/2022 18/05/2022
286170 NX002 15/06/2022 18/05/2022
In the output table we can see that for ID = 212175
, only 3 items are showing. That's because only 3 items were sent during the window 20/05/2022 - 12/08/2022
.
For ID = 286170
, only 1 item 'PX003'
isn't showing in the desired output and that's because the date this item was sent was on '10/04/2022'
, which is before the date_start (18/05/2022)
for customer 286170
.
Any help appreciated. Thanks
CodePudding user response:
IIUC, use a merge_asof
:
cust_df['date_start'] = pd.to_datetime(cust_df['date_start'], dayfirst=True)
items_df['date_sent'] = pd.to_datetime(items_df['date_sent'], dayfirst=True)
out = (pd
.merge_asof(items_df.sort_values(by='date_sent'),
cust_df.sort_values(by='date_start'),
by='id', left_on='date_sent', right_on='date_start')
.dropna(subset='date_start')
.loc[lambda d: d['date_sent'] < pd.Timestamp("12/08/2022")]
.sort_values(by=['id', 'item']) # optional
)
output:
id item date_sent date_start
9 212175 NX002 2022-08-15 2022-05-20
4 212175 PX002 2022-05-21 2022-05-20
5 212175 PX003 2022-06-01 2022-05-20
6 286170 NX002 2022-06-15 2022-05-18
3 286170 PX002 2022-05-19 2022-05-18
8 286170 PX003 2022-07-15 2022-05-18
7 286170 PY005 2022-07-10 2022-05-18
CodePudding user response:
Merge and filter
m = pd.merge(items_df, cust_df, how='inner', on='id')
m.date_start = pd.to_datetime(m.date_start, dayfirst=True)
m.date_sent = pd.to_datetime(m.date_sent, dayfirst=True)
m.loc[ (m.date_start < m.date_sent) * ( m.date_sent < pd.to_datetime("12/08/2022", dayfirst=False))].reset_index()
# id item date_sent date_start
#0 212175 PX002 2022-05-21 2022-05-20
#1 212175 PX003 2022-06-01 2022-05-20
#2 212175 NX002 2022-08-15 2022-05-20
#3 286170 PX002 2022-05-19 2022-05-18
#4 286170 PY005 2022-07-10 2022-05-18
#5 286170 PX003 2022-07-15 2022-05-18
#6 286170 NX002 2022-06-15 2022-05-18