Home > Enterprise >  efficient way to identify the period corresponding to a column of dates
efficient way to identify the period corresponding to a column of dates

Time:07-21

I'm looking for an efficient way to take a column of date time objects and for each element of the column, determine which period that date falls within.

Perhaps this is best illustrated with an example. The code below does the job, in that it will give me a period in the "Period" column that contains the time in the "Dates" column. The problem is that for a dataset of any size this is really painfully slow. I'm sure there must be a more efficient way to do this but couldn't find an example of that anywhere.

import pandas as pd
import datetime



today = datetime.datetime.now().date()


dates = pd.date_range(start=today,freq='1H',periods=100)

df=pd.DataFrame({'Dates':dates})

periods = pd.period_range(start=today, freq='d',periods = 10)

df['Period'] = df.Dates.apply(lambda x: [p for p in periods if (p.start_time<=x)&(p.end_time>x)][0])

CodePudding user response:

I changed your set-up to give more interesting results.

import pandas as pd
import datetime
today = datetime.datetime.now().date()
dates = pd.date_range(start=today, freq='d', periods=10)
df = pd.DataFrame({'Dates':dates})
periods = pd.period_range(start=today-pd.to_timedelta('1d'), freq='3d', periods=5)

df & periods

       Dates
0 2022-07-20
1 2022-07-21
2 2022-07-22
3 2022-07-23
4 2022-07-24
5 2022-07-25
6 2022-07-26
7 2022-07-27
8 2022-07-28
9 2022-07-29

PeriodIndex(['2022-07-19', '2022-07-22', '2022-07-25', '2022-07-28',
             '2022-07-31'],
            dtype='period[3D]')

Doing:

df['interval'] = pd.arrays.IntervalArray(pd.cut(df.Dates, bins=periods.to_timestamp(how='s')))
print(df)

Output:

       Dates                  interval
0 2022-07-20  (2022-07-19, 2022-07-22]
1 2022-07-21  (2022-07-19, 2022-07-22]
2 2022-07-22  (2022-07-19, 2022-07-22]
3 2022-07-23  (2022-07-22, 2022-07-25]
4 2022-07-24  (2022-07-22, 2022-07-25]
5 2022-07-25  (2022-07-22, 2022-07-25]
6 2022-07-26  (2022-07-25, 2022-07-28]
7 2022-07-27  (2022-07-25, 2022-07-28]
8 2022-07-28  (2022-07-25, 2022-07-28]
9 2022-07-29  (2022-07-28, 2022-07-31]

We can get that left value by doing:

df['start'] = df['interval'].array.left
print(df)

       Dates                  interval      start
0 2022-07-20  (2022-07-19, 2022-07-22] 2022-07-19
1 2022-07-21  (2022-07-19, 2022-07-22] 2022-07-19
2 2022-07-22  (2022-07-19, 2022-07-22] 2022-07-19
3 2022-07-23  (2022-07-22, 2022-07-25] 2022-07-22
4 2022-07-24  (2022-07-22, 2022-07-25] 2022-07-22
5 2022-07-25  (2022-07-22, 2022-07-25] 2022-07-22
6 2022-07-26  (2022-07-25, 2022-07-28] 2022-07-25
7 2022-07-27  (2022-07-25, 2022-07-28] 2022-07-25
8 2022-07-28  (2022-07-25, 2022-07-28] 2022-07-25
9 2022-07-29  (2022-07-28, 2022-07-31] 2022-07-28
  • Related