Home > OS >  Improving Weighted Moving Average Performance
Improving Weighted Moving Average Performance

Time:11-22

I have been playing around with a pandas data frame with 414,000 rows.

Built into pandas is an exponential moving average computed by:

series.ewm(span=period).mean()

The above executes in < 0.3 seconds. I am however in search of trying to use a weighted moving average (which has a linear linear weighting of each element). I came across the following function:

def WMA(self, s, period):
    return s.rolling(period).apply(lambda x: (np.arange(period) 1*x).sum()/(np.arange(period) 1).sum(), raw=True)

The above function took 27 seconds to execute. I noticed the arange function could be cached and produced the following:

def WMA(self, s, period):
    weights = np.arange(period) 1
    weights_sum = weights.sum()
    return s.rolling(period).apply(lambda x: (weights*x).sum()/weights_sum, raw=True)

The above function took 11 seconds, which is a noticeable improvement.

What I'm trying to figure out is if there is some way I can further optimize this (ideally replace the apply function) but genuinely am not sure how to go about it.

Any ideas would be appreciated!

CodePudding user response:

You can use the np sliding window function docs, then it looks like this:

import numpy as np
import pandas as pd

d1 = pd.DataFrame(np.random.randint(0, 10, size=(500_000))) # x=500_000

p = 50
w = np.arange(p) 1
w_s = w.sum()

########## for comparison purpose ##########
# 1.47 s ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
r = d1.rolling(p).apply(lambda x: (w*x).sum()/w_s, raw=True)

# 62.1 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
swv = np.lib.stride_tricks.sliding_window_view(d1.values.flatten(), window_shape=p)
sw = (swv*w).sum(axis=1) / w_s

########## for comparison purpose ##########
np.array_equal(r.iloc[p - 1:].values.flatten(), sw) # True

So, an overall speedup of ~23.67x. However, you need to adjust the shape to your desired shape afterwards. Since sw starts at 0 with a shape of x-p. Whereas r starts at p, with a shape of x and the first p values -> nan.

CodePudding user response:

Skeletor above was right on the money and I adapted it slightly to handle the issues with nan

    # THIS USES LOWER LEVEL NUMPY TO GREATLY SPEED IT UP!
    def WMA(self, s, period):
        w = np.arange(period) 1
        w_s = w.sum()   
        swv = sliding_window_view(s.values.flatten(), window_shape=period)
        sw = (swv * w).sum(axis=1) / w_s

        # Need to now return it as a normal series
        sw = np.concatenate((np.full(period - 1, np.nan), sw))
        return pd.Series(sw)

dropped it from 11 seconds down to 1.5 seconds which is much better!

  • Related