I have a problem. I want to calculate from a date for example 2022-06-01
how many touches the customer with the customerId == 1
had in the last 6 months
. He had two touches 2022-05-25
and 2022-05-20
. However, I don't know how to group the customer and say the date you have is up to count_from_date
how many touches the customer has had. I also got an KeyError
.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
df_new = df.groupby(['customerId', 'fromDate'], as_index=False)['fromDate'].count()
df_new['count_from_date'] = df_new['fromDate']
df = df.merge(df_new['count_from_date'], how='inner', left_index=True, right_index=True)
(df.set_index(['fromDate']).sort_index().groupby('customerId').apply(lambda s: s['count_from_date'].rolling('180D').sum())- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']
[OUT] KeyError: 'count_from_date'
What I want
customerId fromDate occur_last_6_months
0 1 2022-06-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 1 # 2022-05-20 = 1
2 1 2022-05-25 1 # 2022-05-20 = 1
3 1 2022-05-20 0 # No in the last 6 months
4 1 2021-09-05 0 # No in the last 6 months
5 2 2022-06-02 0 # No in the last 6 months
6 3 2021-03-01 1 # 2021-02-01 = 1
7 3 2021-02-01 0 # No in the last 6 months
CodePudding user response:
If possible sum duplicated values like second and third row count matched values in mask by sum
only True values:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1) - 1
return x
df = df.groupby('customerId').apply(f)
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 2
2 1 2022-05-25 2021-11-25 2
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
If need subtract by all count per duplciated dates instead subtract 1
use GroupBy.transform
with size
:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1)
return x
df = df.groupby('customerId').apply(f)
s = df.groupby(['customerId', 'fromDate'])['customerId'].transform('size')
df['occur_last_6_months'] -= s
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0