Home > database >  Calculate some date questions
Calculate some date questions

Time:05-23

I have a problem. I want to answer some question (see below). Unfortunately I got an error ValueError: Wrong number of items passed 0, placement implies 1. How could I determine the questions?

  • When was the last interactivity how many days ago (from today)?
  • How many orders has the customer placed ?
  • When was the fastest interactivity ?
  • When was the shortest interactivity ?
  • How much was the average of the interactivity ? (For that I calculated the days)

From 2021-02-10 to 2021-02-22 = 12
From 2021-02-22 to 2021-03-18 = 24
From 2021-03-18 to 2021-03-22 = 4
From 2021-03-22 to 2021-09-07 = 109
From 2021-09-07 to 2022-01-18 = 193
-------
68,4 (average days)

Dataframe

    customerId  fromDate
0   1   2021-02-22
1   1   2021-03-18
2   1   2021-03-22
3   1   2021-02-10
4   1   2021-09-07
5   1   None
6   1   2022-01-18
7   2   2021-05-17

Code

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17']
    }
df = pd.DataFrame(data=d)
display(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
#df = df['fromDate'].groupby('customerId', group_keys=False)
df = df.sort_values(['customerId','fromDate'],ascending=False)#.groupby('customerId')

df_new = pd.DataFrame()
df_new['average'] = df.groupby('customerId').mean()

[OUT] AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'

df_new = pd.DataFrame()
df_new['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate'].max()

[OUT] TypeError: '>=' not supported between instances of 'datetime.date' and 'float'

What I want

customerId  lastInteractivity  howMany shortest Longest  Average
2           371                1        None    None      None
1           125                5        4        193      68,4
# shortes, longest, average is None because customer with the Id 2 had only 1 date

CodePudding user response:

Use:

#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])

#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

#count how many missing dates
m = df['lastInteractivity'].notna()
df1 = (df[~m].groupby(['customerId', m.cumsum()])['customerId']
             .size()
             .add(1)
             .reset_index(name='Count'))
print (df1)
   customerId  lastInteractivity  Count
0           1                  1     12
1           1                  2     24
2           1                  3      4
3           1                  4    169
4           1                  5    133

df1 = df1.groupby('customerId').agg(howMany=('Count','size'),
                                  shortest=('Count','min'),
                                  Longest=('Count','max'),
                                  Average=('Count','mean'))

#get last lastInteractivity and joined df1
df = (df.groupby('customerId')['lastInteractivity']
        .last()
        .dt.days
        .sort_index(ascending=False)
        .to_frame()
        .join(df1)
        .reset_index())
print (df)

   customerId  lastInteractivity  howMany  shortest  Longest  Average
0           2                371      NaN       NaN      NaN      NaN
1           1                125      5.0       4.0    169.0     68.4
  • Related