What is the fastest way to repeatedly resample timeseries data of the same shape?
Problem: I have 30 years of an hourly timeseries which I want to resample to yearly and by calendar year (resample rule 'AS'). I need to both find the mean in each year and the sum. There are no missing hours. I then need to do this over 10,000 times. For the script I am writing, this resampling step takes by far the most time and is the limiting factor with regards to optimising the run time. Because of leap years, one cannot resample by a consistent 8760 hours as every forth year has 8784 hours.
Example code:
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000 in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds
Solutions I have explored:
- I have made some speed improvements by aggregating multiple timeseries into a dataframe and resampling them at the same time; however, due to the restrictions of the set-up of the wider problem I am solving, I am limited to having 10 timeseries in each dataframe. Therefore, the problem still stands: is there a way to dramatically speed up resampling of timeseries data if you know the shape of the array will always be the same?
- I also looked into using numba but this does not make pandas functions quicker.
Possible solutions which sound reasonable but I cannot find after researching:
- resample 3D array of timeseries data with numpy
- Cache the index that is being resampled and then somehow do every resample after the first resample much faster
Thanks for your help :)
CodePudding user response:
As I wrote in comment, I prepared indices for each year and used them to calculate sum much for each year faster.
Next I removed unnecessary calculation of sum under mean again, instead calculating mean as sum/length_of_indices
for each year.
For N=1000 its ~9x faster
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000 in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100): # setting as 100 so it runs faster, this is 10,000 in practice
temp = hourly_timeseries.values
yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)
Ran in 0.9950 seconds
Ran in 0.1386 seconds