I have a dataframe with a few million rows where I want to calculate the difference on a daily basis between two columns which are in datetime format.
There are stack overflow questions which answer this question computing the difference on a timestamp basis (see here
Doing it on the timestamp basis felt quite fast:
df["Differnce"] = (df["end_date"] - df["start_date"]).dt.days
But doing it on a daily basis felt quite slow:
df["Differnce"] = (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
I was wondering if there is a easy but better/faster way to achieve the same result?
Example Code:
import pandas as pd
import numpy as np
data = {'Condition' :["a", "a", "b"],
'start_date': [pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000')],
'end_date': [pd.Timestamp('2022-01-02 01:00:00.000000'), pd.Timestamp('2022-02-01 23:00:00.000000'), pd.Timestamp('2022-01-02 01:00:00.000000')]}
df = pd.DataFrame(data)
df["Right_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"].dt.date - df["start_date"].dt.date).dt.days), np.nan)
df["Wrong_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"] - df["start_date"]).dt.days), np.nan)
CodePudding user response:
Use Series.dt.to_period
, faster is Series.dt.normalize
or Series.dt.floor
:
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [286]: %timeit (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
1.14 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [287]: %timeit df["end_date"].dt.to_period('d').astype('int') - df["start_date"].dt.to_period('d').astype('int')
64.1 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [288]: %timeit (df["end_date"].dt.normalize() - df["start_date"].dt.normalize()).dt.days
27.7 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [289]: %timeit (df["end_date"].dt.floor('d') - df["start_date"].dt.floor('d')).dt.days
27.7 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)