How to make a retention calculation in pandas more efficient?-CodePudding

I am trying to calculate 7day retention (did the user come back WITHIN 7 days?) on a user-id basis. Currently, I am using this code:

df_retention['seven_day_retention']=df_retention.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< 8).astype(int) )

This procedure across 10M rows is taking hours and is not feasible. Is there a better way working within Databricks?

CodePudding user response：

Your code is very slow. I think you must change your approach. You can first sort your dataframe based on person id and date. Then you can use a for loop to compare each row and next row. This code has O(n). If you want you can use faster way. For example in the 2th section you can use from your sample code without groupby and transform and just calculate difference between each row and next row