Time periods to evenly-spaced time series-CodePudding

I need to prepare data with time periods for machine learning in the way that I get equal spacing between timestamps. For example, for 3 hours spacing, I would like to have the following timestamps: 00:00, 03:00, 6:00, 9:00, 12:00, 15:00... For example:

df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
    df[col] = df[col].apply(pd.to_datetime)
print(df)

Output:

                Start                 End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00   a
1 2022-07-01 22:30:00 2022-07-02 03:30:00   b

I try to get timestamps:

df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])

Output:

             Datetime Val
0 2022-07-01 12:00:00   a
0 2022-07-01 14:00:00   a
0 2022-07-01 18:00:00   a
1 2022-07-01 22:00:00   b
1 2022-07-02 02:00:00   b

As you can see, those timestamps are not equally spaced. My expected result:

            Datetime  Val
4 2022-07-01 12:00:00    a
5 2022-07-01 15:00:00    a
6 2022-07-01 18:00:00    a
7 2022-07-01 21:00:00  NaN
8 2022-07-02 00:00:00    b
9 2022-07-02 03:00:00    b

CodePudding user response：

We can use the function merge_asof:

df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max()   pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())

Output:

            Datetime  Val
4 2022-07-01 12:00:00    a
5 2022-07-01 15:00:00    a
6 2022-07-01 18:00:00    a
7 2022-07-01 21:00:00  NaN
8 2022-07-02 00:00:00    b
9 2022-07-02 03:00:00    b

CodePudding user response：

Annotated code

# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()

# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')}) 

# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')

# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])

Result

             datetime  Val
0 2022-07-01 12:00:00    a
1 2022-07-01 15:00:00    a
2 2022-07-01 18:00:00    a
3 2022-07-01 21:00:00  NaN
4 2022-07-02 00:00:00    b
5 2022-07-02 03:00:00    b

CodePudding user response：

For this problem I really like to use pd.DataFrame.reindex. In particular, you can specify the method='nearest, and a tolerance='90m' to ensure you leave gaps where you need them.

You can create you regular spaced time series using pd.date_range with start and end arguments using the .floor('3H') and .ceil('3H') methods, respectively.

import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
    df[col] = df[col].apply(pd.to_datetime)

df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)

result = pd.DataFrame()

for name, group in df.groupby('Val'):
    group = group.set_index('Datetime')
    group.index = group.index.ceil('1H')
    idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
    group = group.reindex(idx, tolerance = '90m', method='nearest')
    result = pd.concat([result, group])

result = result.sort_index()

which returns:

                   Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00 
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b

CodePudding user response：

Another method would be to simply add the timehours to the start time in a loop.

from datetime import datetime,timedelta

#taking start time as current time just for example
start = datetime.now()
#taking end time as current time   15 hours just for example
end = datetime.now()   timedelta(hours = 15)


times = []
while end>start:
    start = start timedelta(hours = 3)
    print(start)
    times.append(start)

df = pd.Dataframe(columns = ['Times'])
df['Times'] = times

Output

Times
0   2022-07-15 01:28:56.912013
1   2022-07-15 04:28:56.912013
2   2022-07-15 07:28:56.912013
3   2022-07-15 10:28:56.912013
4   2022-07-15 13:28:56.912013