Calculate difference to the last date-CodePudding

I have a problem. I want to get the difference of the last date. For example 2021-03-22 until the next date (2021-03-18) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId. So the complete calculation should be for each customer. And the last date should be None because I dont have any older date. The problem is that if the same date occurs more than once, the second date becomes 0. It should look again at when the previous date was and not take the current date.

Dataframe

    customerId    fromDate otherInformation
0            1  2021-02-22              Cat
1            1  2021-02-22              Dog
2            1  2021-03-18          Elefant
3            1  2021-03-18              Cat
4            1  2021-03-18              Cat
5            1  2021-03-22              Cat
6            1  2021-02-10              Cat
7            1  2021-09-07              Cat
8            1        None          Elefant
9            1  2022-01-18             Fish
10           2  2021-05-17             Fish

Code

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
     'fromDate': ['2021-02-22','2021-02-22', '2021-03-18','2021-03-18', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
     'otherInformation': ['Cat', 'Dog', 'Elefant', 'Cat', 'Cat','Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
    }
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
print(df)

What I have

    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog     0 days
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat     0 days
4            1 2021-03-18              Cat     0 days
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT

What I want

    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog    12 days # from 0 -> 12
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat    24 days # from 0 -> 24
4            1 2021-03-18              Cat    24 days # from 0 -> 24
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT

CodePudding user response：

Add DataFrame.drop_duplicates with GroupBy.ffill:

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').drop_duplicates(['customerId','fromDate']).groupby('customerId')['fromDate'].shift()
df['lastindays'] = df.groupby(['customerId','fromDate'])['lastindays'].ffill()
print(df)
    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog    12 days
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat    24 days
4            1 2021-03-18              Cat    24 days
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT

Another idea is create helper DataFrame without duplicates, use DataFrameGroupBy.diff and for new column left join:

df1 = df.sort_values('fromDate').drop_duplicates(['customerId','fromDate'])
df1['lastindays'] = df1.groupby('customerId')['fromDate'].diff()
df = df.merge(df1[['lastindays','customerId','fromDate']], on=['customerId','fromDate'], how='left')
print(df)
    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog    12 days
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat    24 days
4            1 2021-03-18              Cat    24 days
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT