I need to prepare data with time periods for machine learning in the way that I get equal spacing between timestamps. For example, for 3 hours spacing, I would like to have the following timestamps: 00:00, 03:00, 6:00, 9:00, 12:00, 15:00... For example:
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
print(df)
Output:
Start End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00 a
1 2022-07-01 22:30:00 2022-07-02 03:30:00 b
I try to get timestamps:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])
Output:
Datetime Val
0 2022-07-01 12:00:00 a
0 2022-07-01 14:00:00 a
0 2022-07-01 18:00:00 a
1 2022-07-01 22:00:00 b
1 2022-07-02 02:00:00 b
As you can see, those timestamps are not equally spaced. My expected result:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
CodePudding user response:
We can use the function merge_asof
:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max() pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())
Output:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
CodePudding user response:
Annotated code
# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()
# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')})
# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')
# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])
Result
datetime Val
0 2022-07-01 12:00:00 a
1 2022-07-01 15:00:00 a
2 2022-07-01 18:00:00 a
3 2022-07-01 21:00:00 NaN
4 2022-07-02 00:00:00 b
5 2022-07-02 03:00:00 b
CodePudding user response:
For this problem I really like to use pd.DataFrame.reindex. In particular, you can specify the method='nearest
, and a tolerance='90m'
to ensure you leave gaps where you need them.
You can create you regular spaced time series using pd.date_range
with start and end arguments using the .floor('3H')
and .ceil('3H')
methods, respectively.
import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
result = pd.DataFrame()
for name, group in df.groupby('Val'):
group = group.set_index('Datetime')
group.index = group.index.ceil('1H')
idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
group = group.reindex(idx, tolerance = '90m', method='nearest')
result = pd.concat([result, group])
result = result.sort_index()
which returns:
Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b
CodePudding user response:
Another method would be to simply add the timehours to the start time in a loop.
from datetime import datetime,timedelta
#taking start time as current time just for example
start = datetime.now()
#taking end time as current time 15 hours just for example
end = datetime.now() timedelta(hours = 15)
times = []
while end>start:
start = start timedelta(hours = 3)
print(start)
times.append(start)
df = pd.Dataframe(columns = ['Times'])
df['Times'] = times
Output
Times
0 2022-07-15 01:28:56.912013
1 2022-07-15 04:28:56.912013
2 2022-07-15 07:28:56.912013
3 2022-07-15 10:28:56.912013
4 2022-07-15 13:28:56.912013