Calculation of how many touch points the customer has had in the last 6 months-CodePudding

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had. I also got an KeyError.

Dataframe

  customerId    fromDate
0           1  2022-06-01
1           1  2022-05-25
2           1  2022-05-25
3           1  2022-05-20
4           1  2021-09-05
5           2  2022-06-02
6           3  2021-03-01
7           3  2021-02-01

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
     'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
                  "2022-06-02", "2021-03-01", "2021-02-01"]
    }
df = pd.DataFrame(data=d)
print(df)

df_new  = df.groupby(['customerId', 'fromDate'], as_index=False)['fromDate'].count()
df_new['count_from_date'] = df_new['fromDate']
df = df.merge(df_new['count_from_date'], how='inner', left_index=True, right_index=True)
(df.set_index(['fromDate']).sort_index().groupby('customerId').apply(lambda s: s['count_from_date'].rolling('180D').sum())- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

[OUT] KeyError: 'count_from_date'

What I want

   customerId   fromDate      occur_last_6_months
0           1 2022-06-01      3   # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1           1 2022-05-25      1   # 2022-05-20 = 1
2           1 2022-05-25      1   # 2022-05-20 = 1
3           1 2022-05-20      0   # No in the last 6 months
4           1 2021-09-05      0   # No in the last 6 months
5           2 2022-06-02      0   # No in the last 6 months
6           3 2021-03-01      1   # 2021-02-01 = 1
7           3 2021-02-01      0   # No in the last 6 months

CodePudding user response：

If possible sum duplicated values like second and third row count matched values in mask by sum only True values:

df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)

def f(x):
    
    d1 = x["fromDate"].to_numpy()
    d2 = x["last_month"].to_numpy()
    x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1  <= d1[:, None])).sum(axis=1) - 1
   
    return x

df = df.groupby('customerId').apply(f)

print(df)
   customerId   fromDate last_month  occur_last_6_months
0           1 2022-06-01 2021-12-01                    3
1           1 2022-05-25 2021-11-25                    2
2           1 2022-05-25 2021-11-25                    2
3           1 2022-05-20 2021-11-20                    0
4           1 2021-09-05 2021-03-05                    0
5           2 2022-06-02 2021-12-02                    0
6           3 2021-03-01 2020-09-01                    1
7           3 2021-02-01 2020-08-01                    0

If need subtract by all count per duplciated dates instead subtract 1 use GroupBy.transform with size:

df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)

def f(x):
    
    d1 = x["fromDate"].to_numpy()
    d2 = x["last_month"].to_numpy()
    x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1  <= d1[:, None])).sum(axis=1)
   
    return x

df = df.groupby('customerId').apply(f)

s = df.groupby(['customerId', 'fromDate'])['customerId'].transform('size')
df['occur_last_6_months'] -= s

print(df)
   customerId   fromDate last_month  occur_last_6_months
0           1 2022-06-01 2021-12-01                    3
1           1 2022-05-25 2021-11-25                    1
2           1 2022-05-25 2021-11-25                    1
3           1 2022-05-20 2021-11-20                    0
4           1 2021-09-05 2021-03-05                    0
5           2 2022-06-02 2021-12-02                    0
6           3 2021-03-01 2020-09-01                    1
7           3 2021-02-01 2020-08-01                    0