With the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'user_id': ['1', '2', '3'],
'promo_date': ['01012023','01012023','01012023'],
'logins': [['10242022', '11242022', '04122023'], ['10242022', '04122023'], []]
}
)
Which looks like:
user_id promo_date logins
0 1 01012023 [10242022, 11242022, 04122023]
1 2 01012023 [10242022, 04122023]
2 3 01012023 []
I am trying to count all the times a person logged in within 3 months before the promo date. I have a function to do this that I call using apply
, but it is way too slow for the numbers of records I have.
from dateutil.relativedelta import relativedelta
EXPECTED_DATE_FORMAT = '%m%d%Y'
def calculate_NTimesLoggedInXMonths(x_months, promo_date_str, login_dates):
login_count = 0
promo_date = pd.to_datetime(promo_date_str, format=EXPECTED_DATE_FORMAT)
x_month_back = promo_date - relativedelta(months=x_months)
for date in login_dates:
if x_month_back < pd.to_datetime(date, format=EXPECTED_DATE_FORMAT) < promo_date:
login_count = 1
return login_count
# Start the calculation
start = datetime.now()
print("Start time is ", start)
df[f'NTimesLoggedIn3Months'] = df.apply(
lambda row: calculate_NTimesLoggedInXMonths(3, row['promo_date'], row['logins']),
axis=1
)
end = datetime.now()
print("Run time:", end - start)
And the expected result is:
promo_date logins NTimesLoggedIn3Months
0 01012023 [10242022, 11242022, 04122023] 2
1 01012023 [10242022, 04122023] 1
2 01012023 [] 0
I think the best solution would be to take advantage of the Series.dt accessor, but I am not sure how to do this with a list. Even if I break apart the list so each id is repeated for each login, I am still unsure how to use the dt accessor to get this count.
CodePudding user response:
I would recommend expanding out of a list to make better use of pandas optimizations to work on Series.
exp = df.explode('logins')
exp['promo_date'] = pd.to_datetime(exp['promo_date'], format='%m%d%Y')
exp['logins'] = pd.to_datetime(exp['logins'], format='%m%d%Y')
exp['within_3mo'] = ((exp['promo_date'] - pd.DateOffset(months=3) <= exp['logins']) &
(exp['logins'] <= exp['promo_date']))
Then, you can calculate the sums with a groupby on the user_id.
>>> exp.groupby('user_id')['within_3mo'].sum()
user_id
1 2
2 1
3 0
Name: within_3mo, dtype: int64