I'm looking for an efficient way to take a column of date time objects and for each element of the column, determine which period that date falls within.
Perhaps this is best illustrated with an example. The code below does the job, in that it will give me a period in the "Period" column that contains the time in the "Dates" column. The problem is that for a dataset of any size this is really painfully slow. I'm sure there must be a more efficient way to do this but couldn't find an example of that anywhere.
import pandas as pd
import datetime
today = datetime.datetime.now().date()
dates = pd.date_range(start=today,freq='1H',periods=100)
df=pd.DataFrame({'Dates':dates})
periods = pd.period_range(start=today, freq='d',periods = 10)
df['Period'] = df.Dates.apply(lambda x: [p for p in periods if (p.start_time<=x)&(p.end_time>x)][0])
CodePudding user response:
I changed your set-up to give more interesting results.
import pandas as pd
import datetime
today = datetime.datetime.now().date()
dates = pd.date_range(start=today, freq='d', periods=10)
df = pd.DataFrame({'Dates':dates})
periods = pd.period_range(start=today-pd.to_timedelta('1d'), freq='3d', periods=5)
df
& periods
Dates
0 2022-07-20
1 2022-07-21
2 2022-07-22
3 2022-07-23
4 2022-07-24
5 2022-07-25
6 2022-07-26
7 2022-07-27
8 2022-07-28
9 2022-07-29
PeriodIndex(['2022-07-19', '2022-07-22', '2022-07-25', '2022-07-28',
'2022-07-31'],
dtype='period[3D]')
Doing:
df['interval'] = pd.arrays.IntervalArray(pd.cut(df.Dates, bins=periods.to_timestamp(how='s')))
print(df)
Output:
Dates interval
0 2022-07-20 (2022-07-19, 2022-07-22]
1 2022-07-21 (2022-07-19, 2022-07-22]
2 2022-07-22 (2022-07-19, 2022-07-22]
3 2022-07-23 (2022-07-22, 2022-07-25]
4 2022-07-24 (2022-07-22, 2022-07-25]
5 2022-07-25 (2022-07-22, 2022-07-25]
6 2022-07-26 (2022-07-25, 2022-07-28]
7 2022-07-27 (2022-07-25, 2022-07-28]
8 2022-07-28 (2022-07-25, 2022-07-28]
9 2022-07-29 (2022-07-28, 2022-07-31]
We can get that left value by doing:
df['start'] = df['interval'].array.left
print(df)
Dates interval start
0 2022-07-20 (2022-07-19, 2022-07-22] 2022-07-19
1 2022-07-21 (2022-07-19, 2022-07-22] 2022-07-19
2 2022-07-22 (2022-07-19, 2022-07-22] 2022-07-19
3 2022-07-23 (2022-07-22, 2022-07-25] 2022-07-22
4 2022-07-24 (2022-07-22, 2022-07-25] 2022-07-22
5 2022-07-25 (2022-07-22, 2022-07-25] 2022-07-22
6 2022-07-26 (2022-07-25, 2022-07-28] 2022-07-25
7 2022-07-27 (2022-07-25, 2022-07-28] 2022-07-25
8 2022-07-28 (2022-07-25, 2022-07-28] 2022-07-25
9 2022-07-29 (2022-07-28, 2022-07-31] 2022-07-28