Count number of times a date appeared within x months of another date-CodePudding

With the following dataframe:

import pandas as pd

df = pd.DataFrame(
{
    'user_id': ['1', '2', '3'],
    'promo_date': ['01012023','01012023','01012023'],
    'logins': [['10242022', '11242022', '04122023'], ['10242022', '04122023'], []]
 }

)

Which looks like:

  user_id promo_date                          logins
0       1   01012023  [10242022, 11242022, 04122023]
1       2   01012023            [10242022, 04122023]
2       3   01012023                              []

I am trying to count all the times a person logged in within 3 months before the promo date. I have a function to do this that I call using apply, but it is way too slow for the numbers of records I have.

from dateutil.relativedelta import relativedelta
EXPECTED_DATE_FORMAT = '%m%d%Y'


def calculate_NTimesLoggedInXMonths(x_months, promo_date_str, login_dates):
    login_count = 0
    promo_date = pd.to_datetime(promo_date_str, format=EXPECTED_DATE_FORMAT)
    x_month_back = promo_date - relativedelta(months=x_months)
    for date in login_dates:
        if x_month_back < pd.to_datetime(date, format=EXPECTED_DATE_FORMAT) < promo_date:
            login_count  = 1
    return login_count

# Start the calculation
start = datetime.now()
print("Start time is ", start)
df[f'NTimesLoggedIn3Months'] = df.apply(
    lambda row: calculate_NTimesLoggedInXMonths(3, row['promo_date'], row['logins']),
    axis=1
)

end = datetime.now()
print("Run time:", end - start)

And the expected result is:

  promo_date                          logins  NTimesLoggedIn3Months
0   01012023  [10242022, 11242022, 04122023]                      2
1   01012023            [10242022, 04122023]                      1
2   01012023                              []                      0

I think the best solution would be to take advantage of the Series.dt accessor, but I am not sure how to do this with a list. Even if I break apart the list so each id is repeated for each login, I am still unsure how to use the dt accessor to get this count.

CodePudding user response：

I would recommend expanding out of a list to make better use of pandas optimizations to work on Series.

exp = df.explode('logins')
exp['promo_date'] = pd.to_datetime(exp['promo_date'], format='%m%d%Y')
exp['logins'] = pd.to_datetime(exp['logins'], format='%m%d%Y')

exp['within_3mo'] = ((exp['promo_date'] - pd.DateOffset(months=3) <= exp['logins']) & 
    (exp['logins'] <= exp['promo_date']))

Then, you can calculate the sums with a groupby on the user_id.

>>> exp.groupby('user_id')['within_3mo'].sum()
user_id
1    2
2    1
3    0
Name: within_3mo, dtype: int64