I have been playing around with a pandas data frame with 414,000 rows.
Built into pandas is an exponential moving average computed by:
series.ewm(span=period).mean()
The above executes in < 0.3 seconds. I am however in search of trying to use a weighted moving average (which has a linear linear weighting of each element). I came across the following function:
def WMA(self, s, period):
return s.rolling(period).apply(lambda x: (np.arange(period) 1*x).sum()/(np.arange(period) 1).sum(), raw=True)
The above function took 27 seconds to execute. I noticed the arange function could be cached and produced the following:
def WMA(self, s, period):
weights = np.arange(period) 1
weights_sum = weights.sum()
return s.rolling(period).apply(lambda x: (weights*x).sum()/weights_sum, raw=True)
The above function took 11 seconds, which is a noticeable improvement.
What I'm trying to figure out is if there is some way I can further optimize this (ideally replace the apply function) but genuinely am not sure how to go about it.
Any ideas would be appreciated!
CodePudding user response:
You can use the np
sliding window function docs, then it looks like this:
import numpy as np
import pandas as pd
d1 = pd.DataFrame(np.random.randint(0, 10, size=(500_000))) # x=500_000
p = 50
w = np.arange(p) 1
w_s = w.sum()
########## for comparison purpose ##########
# 1.47 s ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
r = d1.rolling(p).apply(lambda x: (w*x).sum()/w_s, raw=True)
# 62.1 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
swv = np.lib.stride_tricks.sliding_window_view(d1.values.flatten(), window_shape=p)
sw = (swv*w).sum(axis=1) / w_s
########## for comparison purpose ##########
np.array_equal(r.iloc[p - 1:].values.flatten(), sw) # True
So, an overall speedup of ~23.67x
. However, you need to adjust the shape to your desired shape afterwards. Since sw
starts at 0
with a shape of x-p
. Whereas r
starts at p
, with a shape of x
and the first p
values -> nan
.
CodePudding user response:
Skeletor above was right on the money and I adapted it slightly to handle the issues with nan
# THIS USES LOWER LEVEL NUMPY TO GREATLY SPEED IT UP!
def WMA(self, s, period):
w = np.arange(period) 1
w_s = w.sum()
swv = sliding_window_view(s.values.flatten(), window_shape=period)
sw = (swv * w).sum(axis=1) / w_s
# Need to now return it as a normal series
sw = np.concatenate((np.full(period - 1, np.nan), sw))
return pd.Series(sw)
dropped it from 11 seconds down to 1.5 seconds which is much better!