I have a problem. I want to answer some question (see below). Unfortunately I got an error ValueError: Wrong number of items passed 0, placement implies 1
. How could I determine the questions?
- When was the last interactivity how many days ago (from today)?
- How many orders has the customer placed ?
- When was the fastest interactivity ?
- When was the shortest interactivity ?
- How much was the average of the interactivity ? (For that I calculated the days)
From 2021-02-10 to 2021-02-22 = 12
From 2021-02-22 to 2021-03-18 = 24
From 2021-03-18 to 2021-03-22 = 4
From 2021-03-22 to 2021-09-07 = 109
From 2021-09-07 to 2022-01-18 = 193
-------
68,4 (average days)
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17']
}
df = pd.DataFrame(data=d)
display(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
#df = df['fromDate'].groupby('customerId', group_keys=False)
df = df.sort_values(['customerId','fromDate'],ascending=False)#.groupby('customerId')
df_new = pd.DataFrame()
df_new['average'] = df.groupby('customerId').mean()
[OUT] AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
df_new = pd.DataFrame()
df_new['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate'].max()
[OUT] TypeError: '>=' not supported between instances of 'datetime.date' and 'float'
What I want
customerId lastInteractivity howMany shortest Longest Average
2 371 1 None None None
1 125 5 4 193 68,4
# shortes, longest, average is None because customer with the Id 2 had only 1 date
CodePudding user response:
Use:
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
#count how many missing dates
m = df['lastInteractivity'].notna()
df1 = (df[~m].groupby(['customerId', m.cumsum()])['customerId']
.size()
.add(1)
.reset_index(name='Count'))
print (df1)
customerId lastInteractivity Count
0 1 1 12
1 1 2 24
2 1 3 4
3 1 4 169
4 1 5 133
df1 = df1.groupby('customerId').agg(howMany=('Count','size'),
shortest=('Count','min'),
Longest=('Count','max'),
Average=('Count','mean'))
#get last lastInteractivity and joined df1
df = (df.groupby('customerId')['lastInteractivity']
.last()
.dt.days
.sort_index(ascending=False)
.to_frame()
.join(df1)
.reset_index())
print (df)
customerId lastInteractivity howMany shortest Longest Average
0 2 371 NaN NaN NaN NaN
1 1 125 5.0 4.0 169.0 68.4