Home > Mobile >  Calculate the difference to the last date in days grouped by customer
Calculate the difference to the last date in days grouped by customer

Time:06-03

I have a problem. I want to get the difference of the last date. For example 2021-03-22 until the next date (2021-03-18) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId. So the complete calculation should be for each customer. And the last date should be None because I dont have any older date.

Dataframe

   customerId    fromDate otherInformation
0           1  2021-02-22              Cat
1           1  2021-03-18              Dog
2           1  2021-03-22              Cat
3           1  2021-02-10              Cat
4           1  2021-09-07              Cat
5           1        None          Elefant
6           1  2022-01-18             Fish
7           2  2021-05-17             Fish

Code

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
     'otherInformation': ['Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
    }
df = pd.DataFrame(data=d)
print(df)

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
df = df.drop_duplicates(subset=['customerId', 'fromDate'], keep='first')
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#print(True in df.index.duplicated())
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

print(df)

What I have

     customerId   fromDate lastInteractivity
0             1 2021-02-10          477 days
1             1 2021-02-11               NaT
2             1 2021-02-12               NaT
3             1 2021-02-13               NaT
4             1 2021-02-14               NaT
..          ...        ...               ...
339           1 2022-01-15               NaT
340           1 2022-01-16               NaT
341           1 2022-01-17               NaT
342           1 2022-01-18          135 days
343           2 2021-05-17          381 days

[344 rows x 3 columns]

What I want

   customerId    fromDate otherInformation    lastInDays
0           1  2021-02-22              Cat          12 #last date 2021-02-10         
1           1  2021-03-18              Dog          36 #last date 2021-02-22
2           1  2021-03-22              Cat           4 #last date 2021-03-18
3           1  2021-02-10              Cat        None #last date not found
4           1  2021-09-07              Cat         169 #last date 2021-03-22
5           1        None          Elefant        None #was None
6           1  2022-01-18             Fish         133 #last date 2021-09-07
7           2  2021-05-17             Fish        None #last date not found

CodePudding user response:

Sort the dataframe by date column then group by customerId and shift the date column then subtract this from the original date column to get the difference in days

df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()

   customerId   fromDate otherInformation lastindays
0           1 2021-02-22              Cat    12 days
1           1 2021-03-18              Dog    24 days
2           1 2021-03-22              Cat     4 days
3           1 2021-02-10              Cat        NaT
4           1 2021-09-07              Cat   169 days
5           1        NaT          Elefant        NaT
6           1 2022-01-18             Fish   133 days
7           2 2021-05-17             Fish        NaT
  • Related