What is the fastest way to repeatedly resample timeseries data of the same shape from hourly to year-CodePudding

What is the fastest way to repeatedly resample timeseries data of the same shape?

Problem: I have 30 years of an hourly timeseries which I want to resample to yearly and by calendar year (resample rule 'AS'). I need to both find the mean in each year and the sum. There are no missing hours. I then need to do this over 10,000 times. For the script I am writing, this resampling step takes by far the most time and is the limiting factor with regards to optimising the run time. Because of leap years, one cannot resample by a consistent 8760 hours as every forth year has 8784 hours.

Example code:

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000  in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds

Solutions I have explored:

I have made some speed improvements by aggregating multiple timeseries into a dataframe and resampling them at the same time; however, due to the restrictions of the set-up of the wider problem I am solving, I am limited to having 10 timeseries in each dataframe. Therefore, the problem still stands: is there a way to dramatically speed up resampling of timeseries data if you know the shape of the array will always be the same?
I also looked into using numba but this does not make pandas functions quicker.

Possible solutions which sound reasonable but I cannot find after researching:

resample 3D array of timeseries data with numpy
Cache the index that is being resampled and then somehow do every resample after the first resample much faster

Thanks for your help :)

CodePudding user response：

As I wrote in comment, I prepared indices for each year and used them to calculate sum much for each year faster.

Next I removed unnecessary calculation of sum under mean again, instead calculating mean as sum/length_of_indices for each year.

For N=1000 its ~9x faster

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000  in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")


start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100):  # setting as 100 so it runs faster, this is 10,000  in practice
    temp = hourly_timeseries.values
    yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
    yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year

finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)

Ran in 0.9950 seconds
Ran in 0.1386 seconds