Find subsequent ID rows within certain date-CodePudding

I have a dataframe df,

ID        purchase_date       duration       end_date      1_month    
1          2021-04-03            4          2021-04-07    2021-05-07     
1          2021-05-01            7          2021-05-08    2021-06-08
2          2021-07-01            10         2021-07-11    2021-08-11
1          2021-07-05            5          2021-07-10    2021-08-10

What I want to do is have an additional column called left whereby if the customer has a subsequent purchase within the date of the 1_month they are assigned 0 and if they do not make a purchase they are assigned 1. Like the below df.

ID        purchase_date       duration       end_date      1_month      left
1          2021-04-03            4          2021-04-07    2021-05-07     0   
1          2021-05-01            7          2021-05-08    2021-06-08     1
2          2021-07-01            10         2021-07-11    2021-08-11     1
1          2021-07-05            5          2021-07-10    2021-08-10     1

Effectively I need to take the 1_month column, groupby ID and see if they have any future purchases within the '1 month' window since this row of data, this needs to work with customers who both have no subsequent purchases and those who have multiple purchases.

A primary issue with this question is that almost 50% of customers only have 1 row of data as they only make 1 purchase. These IDs should also be given a left value =1 as they do not return within the 1_month window. As purchases only range from 2021-01-01 to 2022-01-01 all purchases by customers, including their last purchase needs to be evaluated to see if there is ever a gap in purchases greater than 30 days.

CodePudding user response：

df['left']=np.where(df['purchase_date'] < df['1_month'].shift(-1), 0, 1)
df

    ID  purchase_date   duration    end_date    1_month     left
0   1   2021-04-03            4     2021-04-07  2021-05-07  0
1   1   2021-05-01            7     2021-05-08  2021-06-08  1

CodePudding user response：

You need to use a merge_asof to be able to identify a future date per ID.

I provided a more complex input to better see the result:

df['left'] = (pd
 .merge_asof(df.sort_values(by='1_month'),
             df.sort_values(by='purchase_date'),
             left_on='1_month',
             right_on='purchase_date',
             suffixes=(None, '_next')
             )
 .eval('~(purchase_date_next>purchase_date)')
 .astype(int)
)

output:

   ID purchase_date duration   end_date    1_month  left
0   1    2021-04-03        4 2021-04-07 2021-05-07     0
1   1    2021-05-01        7 2021-05-08 2021-06-08     1
2   1    2022-05-01        x 2022-05-08 2022-06-08     1
3   1    2022-06-01        x 2022-06-08 2022-07-08     0
4   2    2021-05-01        x 2021-05-08 2021-06-08     1

used input:

   ID purchase_date duration   end_date    1_month
0   1    2021-04-03        4 2021-04-07 2021-05-07
1   1    2021-05-01        7 2021-05-08 2021-06-08
2   1    2022-05-01        x 2022-05-08 2022-06-08
3   1    2022-06-01        x 2022-06-08 2022-07-08
4   2    2021-05-01        x 2021-05-08 2021-06-08

CodePudding user response：

df_sort = df.sort_values(by=['ID', 'purchase_date', '1_month'])


def check(x, d):
    y = d.loc[x.name]
    if not math.isnan(y['ID']) and x['end_date'] <= y['purchase_date'] <= x['1_month'] and y['ID'] == x['ID']:
        return 0
    return 1


df_sort['left'] = df_sort.apply(lambda x, d=df_sort.shift(-1): check(x, d), axis=1)

print(df_sort.sort_index())

output:

    ID purchase_date  duration    end_date     1_month  left
0   1    2021-04-03         4  2021-04-07  2021-05-07     0
1   1    2021-05-01         7  2021-05-08  2021-06-08     1
2   2    2021-07-01        10  2021-07-11  2021-08-11     1
3   1    2021-07-05         5  2021-07-10  2021-08-10     1