Calculate how many touch points the customer had in X months-CodePudding

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.

Dataframe

  customerId    fromDate
0           1  2022-06-01
1           1  2022-05-25
2           1  2022-05-25
3           1  2022-05-20
4           1  2021-09-05
5           2  2022-06-02
6           3  2021-03-01
7           3  2021-02-01

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
     'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
                  "2022-06-02", "2021-03-01", "2021-02-01"]
    }
df = pd.DataFrame(data=d)
print(df)

from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
  six_months = date   relativedelta(months=-6)
  return six_months

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)

What I have

   customerId   fromDate count_from_date
0           1 2022-06-01      2021-12-01
1           1 2022-05-25      2021-11-25
2           1 2022-05-25      2021-11-25
3           1 2022-05-20      2021-11-20
4           1 2021-09-05      2021-03-05
5           2 2022-06-02      2021-12-02
6           3 2021-03-01      2020-09-01
7           3 2021-02-01      2020-08-01

What I want

   customerId   fromDate count_from_date  occur_last_6_months
0           1 2022-06-01      2021-12-01  3   # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1           1 2022-05-25      2021-11-25  1   # 2022-05-20 = 1
2           1 2022-05-25      2021-11-25  1   # 2022-05-20 = 1
3           1 2022-05-20      2021-11-20  0   # No in the last 6 months
4           1 2021-09-05      2021-03-05  0   # No in the last 6 months
5           2 2022-06-02      2021-12-02  0   # No in the last 6 months
6           3 2021-03-01      2020-09-01  1   # 2021-02-01 = 1
7           3 2021-02-01      2020-08-01  0   # No in the last 6 months

CodePudding user response：

You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date

def count(g):
    m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
                   for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
    g = g.assign(occur_last_6_months=m.sum().tolist())
    return g

out = df.groupby('customerId').apply(count)

print(out)

   customerId   fromDate count_from_date  occur_last_6_months
0           1 2022-06-01      2021-12-01                    3
1           1 2022-05-25      2021-11-25                    1
2           1 2022-05-25      2021-11-25                    1
3           1 2022-05-20      2021-11-20                    0
4           1 2021-09-05      2021-03-05                    0
5           2 2022-06-02      2021-12-02                    0
6           3 2021-03-01      2020-09-01                    1
7           3 2021-02-01      2020-08-01                    0

CodePudding user response：

For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.

First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:

>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count() 

   customerId   fromDate  count_from_date
0           1 2021-09-05                1
1           1 2022-05-20                1
2           1 2022-05-25                2
3           1 2022-06-01                1
4           2 2022-06-02                1
5           3 2021-02-01                1
6           3 2021-03-01                1

Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):

>>> roll_df = df.set_index(['fromDate'])
                .sort_index()
                .groupby('customerId')
                .apply(lambda s: s['count_from_date'].rolling('180D').sum())

The sort_index step is important to ensure your data is monotonically increasing.

However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:

>>> roll_df - 1

customerId  fromDate  
1           2021-09-05    0.0
            2022-05-20    0.0
            2022-05-25    2.0
            2022-06-01    3.0
2           2022-06-02    0.0
3           2021-02-01    0.0
            2021-03-01    1.0
Name: count_from_date, dtype: float64

Finally, we divide by the initial counts to get back to the original structure:

 >>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']

  customerId   fromDate  count_from_date
0           1 2021-09-05              0.0
1           1 2022-05-20              0.0
2           1 2022-05-25              1.0
3           1 2022-06-01              3.0
4           2 2022-06-02              0.0
5           3 2021-02-01              0.0
6           3 2021-03-01              1.0

You can always .reset_index() at the end.

The one liner solution is

(df.set_index(['fromDate'])
   .sort_index()
   .groupby('customerId')
   .apply(lambda s: s['count_from_date'].rolling('180D').sum())
   - 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']