I have a problem. I want to get the difference of the last date. For example 2021-03-22
until the next date (2021-03-18
) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId
. So the complete calculation should be for each customer. And the last date should be None
because I dont have any older date. The problem is that if the same date occurs more than once, the second date becomes 0
. It should look again at when the previous date was and not take the current date.
Dataframe
customerId fromDate otherInformation
0 1 2021-02-22 Cat
1 1 2021-02-22 Dog
2 1 2021-03-18 Elefant
3 1 2021-03-18 Cat
4 1 2021-03-18 Cat
5 1 2021-03-22 Cat
6 1 2021-02-10 Cat
7 1 2021-09-07 Cat
8 1 None Elefant
9 1 2022-01-18 Fish
10 2 2021-05-17 Fish
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22','2021-02-22', '2021-03-18','2021-03-18', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
'otherInformation': ['Cat', 'Dog', 'Elefant', 'Cat', 'Cat','Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
}
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
print(df)
What I have
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 0 days
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 0 days
4 1 2021-03-18 Cat 0 days
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT
What I want
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 12 days # from 0 -> 12
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 24 days # from 0 -> 24
4 1 2021-03-18 Cat 24 days # from 0 -> 24
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT
CodePudding user response:
Add DataFrame.drop_duplicates
with GroupBy.ffill
:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').drop_duplicates(['customerId','fromDate']).groupby('customerId')['fromDate'].shift()
df['lastindays'] = df.groupby(['customerId','fromDate'])['lastindays'].ffill()
print(df)
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 12 days
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 24 days
4 1 2021-03-18 Cat 24 days
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT
Another idea is create helper DataFrame without duplicates, use DataFrameGroupBy.diff
and for new column left join:
df1 = df.sort_values('fromDate').drop_duplicates(['customerId','fromDate'])
df1['lastindays'] = df1.groupby('customerId')['fromDate'].diff()
df = df.merge(df1[['lastindays','customerId','fromDate']], on=['customerId','fromDate'], how='left')
print(df)
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 12 days
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 24 days
4 1 2021-03-18 Cat 24 days
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT