I have a dataframe df
,
ID purchase_date duration end_date 1_month
1 2021-04-03 4 2021-04-07 2021-05-07
1 2021-05-01 7 2021-05-08 2021-06-08
2 2021-07-01 10 2021-07-11 2021-08-11
1 2021-07-05 5 2021-07-10 2021-08-10
What I want to do is have an additional column called left
whereby if the customer has a subsequent purchase within the date of the 1_month
they are assigned 0 and if they do not make a purchase they are assigned 1. Like the below df
.
ID purchase_date duration end_date 1_month left
1 2021-04-03 4 2021-04-07 2021-05-07 0
1 2021-05-01 7 2021-05-08 2021-06-08 1
2 2021-07-01 10 2021-07-11 2021-08-11 1
1 2021-07-05 5 2021-07-10 2021-08-10 1
Effectively I need to take the 1_month
column, groupby ID and see if they have any future purchases within the '1 month' window since this row of data, this needs to work with customers who both have no subsequent purchases and those who have multiple purchases.
A primary issue with this question is that almost 50% of customers only have 1 row of data as they only make 1 purchase. These IDs should also be given a left
value =1
as they do not return within the 1_month
window. As purchases only range from 2021-01-01 to 2022-01-01 all purchases by customers, including their last purchase needs to be evaluated to see if there is ever a gap in purchases greater than 30 days.
CodePudding user response:
df['left']=np.where(df['purchase_date'] < df['1_month'].shift(-1), 0, 1)
df
ID purchase_date duration end_date 1_month left
0 1 2021-04-03 4 2021-04-07 2021-05-07 0
1 1 2021-05-01 7 2021-05-08 2021-06-08 1
CodePudding user response:
You need to use a merge_asof
to be able to identify a future date per ID.
I provided a more complex input to better see the result:
df['left'] = (pd
.merge_asof(df.sort_values(by='1_month'),
df.sort_values(by='purchase_date'),
left_on='1_month',
right_on='purchase_date',
suffixes=(None, '_next')
)
.eval('~(purchase_date_next>purchase_date)')
.astype(int)
)
output:
ID purchase_date duration end_date 1_month left
0 1 2021-04-03 4 2021-04-07 2021-05-07 0
1 1 2021-05-01 7 2021-05-08 2021-06-08 1
2 1 2022-05-01 x 2022-05-08 2022-06-08 1
3 1 2022-06-01 x 2022-06-08 2022-07-08 0
4 2 2021-05-01 x 2021-05-08 2021-06-08 1
used input:
ID purchase_date duration end_date 1_month
0 1 2021-04-03 4 2021-04-07 2021-05-07
1 1 2021-05-01 7 2021-05-08 2021-06-08
2 1 2022-05-01 x 2022-05-08 2022-06-08
3 1 2022-06-01 x 2022-06-08 2022-07-08
4 2 2021-05-01 x 2021-05-08 2021-06-08
CodePudding user response:
df_sort = df.sort_values(by=['ID', 'purchase_date', '1_month'])
def check(x, d):
y = d.loc[x.name]
if not math.isnan(y['ID']) and x['end_date'] <= y['purchase_date'] <= x['1_month'] and y['ID'] == x['ID']:
return 0
return 1
df_sort['left'] = df_sort.apply(lambda x, d=df_sort.shift(-1): check(x, d), axis=1)
print(df_sort.sort_index())
output:
ID purchase_date duration end_date 1_month left
0 1 2021-04-03 4 2021-04-07 2021-05-07 0
1 1 2021-05-01 7 2021-05-08 2021-06-08 1
2 2 2021-07-01 10 2021-07-11 2021-08-11 1
3 1 2021-07-05 5 2021-07-10 2021-08-10 1