I am interested in calculating the time difference (in days) of time between orders from the same provider
sorce data frame
OrderNumber | CatID | CustomerID | CreateDate |
---|---|---|---|
6543 | 4901 | 1111 | 2008-01-01 |
6543 | 4901 | 2222 | 2008-01-03 |
1113 | 4910 | 3333 | 2008-01-03 |
1234 | 5436 | 1111 | 2008-01-05 |
1245 | 6434 | 2222 | 2008-01-10 |
1456 | 2453 | 1111 | 2008-01-12 |
1567 | 64355 | 3333 | 2008-01-13 |
** the OrderNumber and CatID not interesting **
i want calc the diffrence between date in same costumerID
TARGET DateFrame
OrderNumber | CatID | CustomerID | CreateDate | calc days |
---|---|---|---|---|
6543 | 4901 | 1111 | 2008-01-01 | 4 days |
6543 | 4901 | 2222 | 2008-01-03 | 7 days |
1113 | 4910 | 3333 | 2008-01-03 | 10 days |
1234 | 5436 | 1111 | 2008-01-05 | 7 days |
1245 | 6434 | 2222 | 2008-01-10 | 5 days |
1456 | 2453 | 1111 | 2008-01-12 | 1 days |
1567 | 64355 | 3333 | 2008-01-13 | - |
6436 | 64355 | 1111 | 2008-01-13 | - |
1253 | 64355 | 2222 | 2008-01-15 | - |
** the OrderNumber and CatID not interesting **
orginal data
OrderNumber CustomerID CreateDate
0 27986 712841200 2008-01-01
1 28917 712841200 2008-04-16
2 28451 712630100 2008-02-27
3 29836 712630100 2008-08-21
4 30158 712630100 2008-10-27
... ... ... ...
21621 44318 712841600 2014-12-03
21622 44322 712841600 2014-12-03
21623 42829 712847200 2014-03-05
21624 43547 712847200 2014-06-26
21625 42830 712847200 2014-03-05
CodePudding user response:
You must first convert your CreateDate
column to datetime
format:
df['CreateDate'] = pd.to_datetime(df['CreateDate'])
Then, you can find the location of the same CustomerID
using the .loc
method:
customer_ID = df.loc[df['CustomerID'] == '1111']
Finally, you can compute the difference days between two row:
abs((customer_ID['CreateDate'].loc[0] - customer_ID['CreateDate'][1]).days)
The abs
is for ensuring to return the positive value.
CodePudding user response:
As @Mohammadreza Riahi said, it would be best if you convert your CreateDate column to datetime format (but save the old date):
df['CreateDate_dt'] = pd.to_datetime(df['CreateDate'])
(At this point, it's recommended to confirm the dates were parsed correctly).
Then I would order by customer id and date:
df = df.sort_values([CustomerID,CreateDate_dt])
Then add this column:
df['date_diff'] = np.where(df['CustomerID']==df['CustomerID'].shift(),
df['CreateDate_dt'] df['CreateDate_dt'].shift(),
np.nan)
The cool thing here is that there's NaN next to the first order of the customer and values in days between all the rest.
if you want days in INT:
df['date_diff_days']=df['date_diff'].astype('timedelta64[D]')
Now you can do this for example:
df.groupby('CustomerID').agg(purchases=('OrderNumber', 'nunique'),
first_purchase=('CreateDate_dt', 'min'),
mean_days_between_purchases=('date_diff_days', 'mean'))
CodePudding user response:
the answer is
data = data.sort_values(by=['CustomerID', 'CreateDate'], ignore_index=True)
data['DaysFromPrevCustOrder'] = np.where(
data['CustomerID'].shift(1) == data['CustomerID'],
(data['CreateDate'] - data['CreateDate'].shift(1)).dt.days,np.nan)
** ## output ## **
CustomerID CreateDate OrderNumber DaysFromPrevCustOrder
0 712013005 2008-02-13 28327 NaN
1 712013005 2008-02-13 28327 0.0
2 712013005 2008-02-13 28327 0.0
3 712013005 2009-03-22 31051 403.0
4 712013005 2009-03-22 31051 0.0
... ... ... ... ...
21621 712970100 2009-06-23 31662 9.0
21622 712970100 2010-04-08 33420 289.0
21623 712970100 2011-01-06 35184 273.0
21624 712970100 2012-05-28 38712 508.0
21625 7122973100 2008-09-28 30020 NaN