Home > Software design >  rolling regression with a simple apply in pandas
rolling regression with a simple apply in pandas

Time:11-15

Consider this simple example

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
                   'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})

I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.

However, the simple code below does not work

df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())

  File "<string>", line 1, in <module>

PatsyError: Error evaluating factor: NameError: name 'b' is not defined
    a ~ b
    ^

What is the issue? Thanks!

CodePudding user response:

rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.

We can then create some function to produce each row in the results from the regression results to do something like:

def process(x):
    if len(x) >= 10:
        reg = smf.ols('a ~ b', data=x).fit()
        print(reg.params)
        return [
            # b from params
            reg.params['b'],
            # b from tvalues
            reg.tvalues['b'],
            # Both lower and upper b from conf_int()
            *reg.conf_int().loc['b', :].tolist()
        ]
    # Return NaN in the same dimension as the results
    return [np.nan] * 4


df = df.join(
    # join new DataFrame back to original
    pd.DataFrame(
        (process(x) for x in df.rolling(10)),
        columns=['coef', 't', 'lower', 'upper']
    )
)

df:

    a  b      coef         t     lower     upper
0   1  3       NaN       NaN       NaN       NaN
1   3  5       NaN       NaN       NaN       NaN
2   5  6       NaN       NaN       NaN       NaN
3   7  2       NaN       NaN       NaN       NaN
4   4  4       NaN       NaN       NaN       NaN
5   5  6       NaN       NaN       NaN       NaN
6   6  2       NaN       NaN       NaN       NaN
7   4  5       NaN       NaN       NaN       NaN
8   7  7       NaN       NaN       NaN       NaN
9   8  1 -0.216802 -0.602168 -1.047047  0.613442
10  9  9  0.042781  0.156592 -0.587217  0.672778
11  1  5  0.032086  0.097763 -0.724742  0.788913
12  3  3  0.113475  0.329006 -0.681872  0.908822
13  5  2  0.198582  0.600297 -0.564258  0.961421
14  7  5  0.203540  0.611002 -0.564646  0.971726
15  4  4  0.236599  0.686744 -0.557872  1.031069
16  5  3  0.293651  0.835945 -0.516403  1.103704
17  6  6  0.314286  0.936382 -0.459698  1.088269
18  4  4  0.276316  0.760812 -0.561191  1.113823
19  7  1  0.346491  1.028220 -0.430590  1.123572
20  8  1 -0.492424 -1.234601 -1.412181  0.427332
21  9  9  0.235075  0.879433 -0.381326  0.851476

Setup:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

df = pd.DataFrame({
    'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
    'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
})

CodePudding user response:

Rolling.apply applies the rolling operation to each column separately (Related question)

Following Following user3226167's answer of this thread, it seems that easiest way to accomplish what you want is to use RollingOLS.from_formula from statsmodels.regression.rolling.

from statsmodels.regression.rolling import RollingOLS

df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
                   'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})

model = RollingOLS.from_formula('a ~ b', data = df, window=10)
reg_obj = model.fit()

where reg_obj is a RollingRegressionResults which holds many information about the regression (see the different attributes in the docs)

>>> type(reg_obj)
<class 'statsmodels.regression.rolling.RollingRegressionResults'>

>>> reg_obj.params

    Intercept         b
0         NaN       NaN
1         NaN       NaN
2         NaN       NaN
3         NaN       NaN
4         NaN       NaN
5         NaN       NaN
6         NaN       NaN
7         NaN       NaN
8         NaN       NaN
9    5.888889 -0.216802
10   5.598930  0.042781
11   5.449198  0.032086
12   4.900709  0.113475
13   4.326241  0.198582
14   4.584071  0.203540
15   4.382625  0.236599
16   4.007937  0.293651
17   4.085714  0.314286
18   4.039474  0.276316
19   3.644737  0.346491
20   6.674242 -0.492424
21   4.906716  0.235075

# add the params to df 
>>> df = df.join(reg_obj.params.rename(columns={'b':'coeff'}))
>>> df

    a  b  Intercept     coeff
0   1  3        NaN       NaN
1   3  5        NaN       NaN
2   5  6        NaN       NaN
3   7  2        NaN       NaN
4   4  4        NaN       NaN
5   5  6        NaN       NaN
6   6  2        NaN       NaN
7   4  5        NaN       NaN
8   7  7        NaN       NaN
9   8  1   5.888889 -0.216802
10  9  9   5.598930  0.042781
11  1  5   5.449198  0.032086
12  3  3   4.900709  0.113475
13  5  2   4.326241  0.198582
14  7  5   4.584071  0.203540
15  4  4   4.382625  0.236599
16  5  3   4.007937  0.293651
17  6  6   4.085714  0.314286
18  4  4   4.039474  0.276316
19  7  1   3.644737  0.346491
20  8  1   6.674242 -0.492424
21  9  9   4.906716  0.235075
  • Related