Home > Net >  Find the closest date for each customerId and calculate the time span
Find the closest date for each customerId and calculate the time span

Time:05-25

I have a problem. I have a dataframe that contains the customerId and a date fromDate. Now I want to calculate for each customer individually when the next delivery is. For example, I have the customer with the customerId = 1 and he has bought something on 2021-03-18 I would now like to find the next date and output this distance in days e.g. 2021-03-22 and 4 days. In simple terms I want to calculate the next date in the future - from Date or n - (n-1). Unless the date has a next date, it should be None e.g. 2022-01-18 should be None.

I have a problem, I get a lot of None values, moreover, I should look at each customer separately. How can I do this?

Mathematical with an example

n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
    customerId    fromDate next_day_in_days
1            1  2021-03-18         4

Dataframe

    customerId    fromDate
0            1  2021-02-22
1            1  2021-03-18
2            1  2021-03-22
3            1  2021-02-10
4            1  2021-09-07
5            1        None
6            1  2022-01-18
7            2  2021-05-17
8            3  2021-05-17
9            3  2021-07-17
10           3  2021-02-22
11           3  2021-02-22

Code

import pandas as pd
import datetime

d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
    }
df = pd.DataFrame(data=d)


print(df)
def nearest(items, pivot):
  try:
    return min(items, key=lambda x: abs(x - pivot))
  except:
    return None

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x)) 

Output

[OUT]
    customerId    fromDate next_in_days
0            1  2021-02-22         None
1            1  2021-03-18         None
2            1  2021-03-22         None
3            1  2021-02-10         None
4            1  2021-09-07         None
5            1         NaT         None
6            1  2022-01-18         None
7            2  2021-05-17         None
8            3  2021-05-17         None
9            3  2021-07-17         None
10           3  2021-02-22         None
11           3  2021-02-22         None
Name: next_in_days, dtype: object

What I want

    customerId    fromDate next_day_in_days
0            1  2021-02-22         24
1            1  2021-03-18         4
2            1  2021-03-22         109
3            1  2021-02-10         12
4            1  2021-09-07         133
5            1         NaT         None
6            1  2022-01-18         None
7            2  2021-05-17         None
8            3  2021-05-17         61
9            3  2021-07-17         None
10           3  2021-02-22         133
11           3  2021-02-22         133

CodePudding user response:

First sorting columns per customerId and fromDate, because possible duplicates remove them by same columns, so possible use DataFrameGroupBy.diff with Series.dt.days:

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
    
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
                            .groupby('customerId')['fromDate']
                            .diff(-1)
                            .dt.days
                            .abs())

Get original ordering of index if necessary.

df = df.sort_index()

Last repeat duplicated values per ['customerId', 'fromDate'], here last value 84.0 by GroupBy.ffill:

df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
    customerId   fromDate  next_day_in_days
0            1 2021-02-22              24.0
1            1 2021-03-18               4.0
2            1 2021-03-22             169.0
3            1 2021-02-10              12.0
4            1 2021-09-07             133.0
5            1        NaT               NaN
6            1 2022-01-18               NaN
7            2 2021-05-17               NaN
8            3 2021-05-17              61.0
9            3 2021-07-17               NaN
10           3 2021-02-22              84.0
11           3 2021-02-22              84.0
  • Related