Home > Blockchain >  How to turn a pandas for loop into a vectorized solution
How to turn a pandas for loop into a vectorized solution

Time:03-02

What I am trying to achieve here is to compute the NetChgAvg for each row, based on a previous row value. Here is the formula:

 NetChgAvg = NetChgAvg[-1]   SF * ( Returns - NetChgAvg[-1] )

and here is a part of the data set:

(The index starts at 15 and the value 15 is computed using the previous n=length entries. I removed those indexes for convenience)

df.head(5)
Out[11]: 
          Date   Close  NetChgAvg  TotChgAvg        RSI  Returns
15  09/22/2021  342.50  -0.789286   1.953571  29.798903     3.39
16  09/23/2021  347.56        NaN        NaN        NaN     5.06
17  09/24/2021  347.87        NaN        NaN        NaN     0.31
18  09/27/2021  348.61        NaN        NaN        NaN     0.74
19  09/28/2021  342.92        NaN        NaN        NaN    -5.69

Currently I'm computing this with the following for loop:

for i in range((length   2), df.index[-1]   1):
    df.loc[i, 'NetChgAvg'] = df.loc[i-1, 'NetChgAvg']   smoothing_factor * (df.loc[i, 'Returns'] - df.loc[i-1, 'NetChgAvg'])
df

Which results in the following data frame:

df.head(5)
Out[11]: 
          Date   Close  NetChgAvg  TotChgAvg        RSI  Returns
15  09/22/2021  342.50  -0.789286   1.953571  29.798903     3.39
16  09/23/2021  347.56  -0.371480        NaN        NaN     5.06
17  09/24/2021  347.87  -0.322802        NaN        NaN     0.31
18  09/27/2021  348.61  -0.246888        NaN        NaN     0.74
19  09/28/2021  342.92  -0.635682        NaN        NaN    -5.69

The values are correct, however this computation is slow. Considering the size of data sets I intend to run through this, I can expect computation time to lead into hours. Is there any way I can vectorize this, since each NetChgAvg is based on the previous value? If not, what alternatives do I have to speed up this for loop (maybe Numba)?

CodePudding user response:

To reference the previous row in a vectorized operation, you can use df.shift. The formula you described initially could be written as:

NetChgAvg_shifted = df['NetChgAvg'].shift(-1)
df['NetChgAvg'] = NetChgAvg_shifted   smoothing_factor * (df['Returns'] - NetChgAvg_shifted)

CodePudding user response:

I found that, given the incremental nature of the calculation, it looks like I can't vectorize. I had luck speeding up the loop by replacing .loc lookups with .at after seeing that it's preferred for scalar lookups. Here's the speed improvements for anyone interested:

%timeit netChgAvg() 19.7 ms ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit at_netChgAvg() 1.82 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • Related