I have a problem. I have a dataframe that contains the customerId
and a date fromDate
. Now I want to calculate for each customer individually when the next delivery is. For example, I have the customer with the customerId = 1
and he has bought something on 2021-03-18
I would now like to find the next date and output this distance in days e.g. 2021-03-22
and 4 days
. In simple terms I want to calculate the next date in the future - from Date
or n - (n-1)
. Unless the date has a next date, it should be None
e.g. 2022-01-18
should be None
.
I have a problem, I get a lot of None
values, moreover, I should look at each customer separately. How can I do this?
Mathematical with an example
n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
customerId fromDate next_day_in_days
1 1 2021-03-18 4
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
Code
import pandas as pd
import datetime
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
print(df)
def nearest(items, pivot):
try:
return min(items, key=lambda x: abs(x - pivot))
except:
return None
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x))
Output
[OUT]
customerId fromDate next_in_days
0 1 2021-02-22 None
1 1 2021-03-18 None
2 1 2021-03-22 None
3 1 2021-02-10 None
4 1 2021-09-07 None
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 None
9 3 2021-07-17 None
10 3 2021-02-22 None
11 3 2021-02-22 None
Name: next_in_days, dtype: object
What I want
customerId fromDate next_day_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 61
9 3 2021-07-17 None
10 3 2021-02-22 133
11 3 2021-02-22 133
CodePudding user response:
First sorting columns per customerId
and fromDate
, because possible duplicates remove them by same columns, so possible use DataFrameGroupBy.diff
with Series.dt.days
:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
.groupby('customerId')['fromDate']
.diff(-1)
.dt.days
.abs())
Get original ordering of index if necessary.
df = df.sort_index()
Last repeat duplicated values per ['customerId', 'fromDate']
, here last value 84.0
by GroupBy.ffill
:
df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
customerId fromDate next_day_in_days
0 1 2021-02-22 24.0
1 1 2021-03-18 4.0
2 1 2021-03-22 169.0
3 1 2021-02-10 12.0
4 1 2021-09-07 133.0
5 1 NaT NaN
6 1 2022-01-18 NaN
7 2 2021-05-17 NaN
8 3 2021-05-17 61.0
9 3 2021-07-17 NaN
10 3 2021-02-22 84.0
11 3 2021-02-22 84.0