I have a problem. I want to get the difference of the last date. For example 2021-03-22
until the next date (2021-03-18
) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId
. So the complete calculation should be for each customer. And the last date should be None
because I dont have any older date.
Dataframe
customerId fromDate otherInformation
0 1 2021-02-22 Cat
1 1 2021-03-18 Dog
2 1 2021-03-22 Cat
3 1 2021-02-10 Cat
4 1 2021-09-07 Cat
5 1 None Elefant
6 1 2022-01-18 Fish
7 2 2021-05-17 Fish
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
'otherInformation': ['Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
}
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
df = df.drop_duplicates(subset=['customerId', 'fromDate'], keep='first')
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#print(True in df.index.duplicated())
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
print(df)
What I have
customerId fromDate lastInteractivity
0 1 2021-02-10 477 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
.. ... ... ...
339 1 2022-01-15 NaT
340 1 2022-01-16 NaT
341 1 2022-01-17 NaT
342 1 2022-01-18 135 days
343 2 2021-05-17 381 days
[344 rows x 3 columns]
What I want
customerId fromDate otherInformation lastInDays
0 1 2021-02-22 Cat 12 #last date 2021-02-10
1 1 2021-03-18 Dog 36 #last date 2021-02-22
2 1 2021-03-22 Cat 4 #last date 2021-03-18
3 1 2021-02-10 Cat None #last date not found
4 1 2021-09-07 Cat 169 #last date 2021-03-22
5 1 None Elefant None #was None
6 1 2022-01-18 Fish 133 #last date 2021-09-07
7 2 2021-05-17 Fish None #last date not found
CodePudding user response:
Sort
the dataframe by date column then group by customerId
and shift
the date column then subtract this from the original date column to get the difference in days
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-03-18 Dog 24 days
2 1 2021-03-22 Cat 4 days
3 1 2021-02-10 Cat NaT
4 1 2021-09-07 Cat 169 days
5 1 NaT Elefant NaT
6 1 2022-01-18 Fish 133 days
7 2 2021-05-17 Fish NaT