Home > Software design >  Pandas Rolling Filter
Pandas Rolling Filter

Time:11-25

I have a df that looks like this for example:

df = pd.DataFrame({'Change': [-0.5, -0.25, 1, 1.5, 0.25,-0.75, -1, 2, .... })

What I'm looking to do is to filter this df['Change'] column for above and below 0, calculate the mean for both, and then add the two results together. But I want to be able to do this on a rolling basis over a window of 100.

I can do this in a very blunt way by just looping through the df as I do in the below:

df['diff'] = np.NAN
for i in range(0, len(df)):
    if i > 100:
        X = df.iloc[i - 100:i]
        ups = X[X['change'] > 0]['change'].mean()
        downs = X[X['change'] < 0]['change'].mean()
        df['diff'].iloc[i] = (ups   downs)

However, it's quite slow and was wondering if there was a more efficient way of doing this?

CodePudding user response:

You can use df.rolling(window=100).apply(foo) where foo is your function. for example:

import pandas as pd
import numpy as np
from time import monotonic

df = pd.DataFrame({'Change': np.random.randint(-5, 5, size=(10000,))})

def foo(x):
    ups = x[x > 0].mean()
    downs = x[x < 0].mean()
    return ups   downs


start = monotonic()
res = df.Change.rolling(100).apply(foo)
print(f'time took: {monotonic() - start:.3f} s.')
Out:
    time took: 2.544 s.

With raw=True keyword argument it will be faster

start = monotonic()
res = df.Change.rolling(100).apply(foo, raw=True)
print(f'with raw=True time took: {monotonic() - start:.3f} s.')

Out:
     with raw=True time took: 0.107 s.

Also yoy can parallelize apply method with parallel-pandas library. It's very simple:

import pandas as pd
import numpy as np
from time import monotonic
from parallel_pandas import ParallelPandas

ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

df = pd.DataFrame({'Change': np.random.randint(-5, 5, size=(10000,))})


def foo(x):
    ups = x[x > 0].mean()
    downs = x[x < 0].mean()
    return ups   downs
#p_apply - is parallel analogue of apply method
start = monotonic()
res = df.Change.rolling(100).p_apply(foo, raw=True)
print(f'parallel with raw=True time took: {monotonic() - start:.3f} s.')
Out:
    parallel with raw=True time took: 0.047 s.

Total speedup is 2.5 / 0.05 = 50!

  • Related