Rolling window produces no effect on dataframe-CodePudding

So I have to perform a rolling window to a set of rows inside a dataframe. The problem is that when I do full_df = full_df.rolling(window=5).mean() the output of full_df.head(2000) shows all NaN values. Does anyone know why this happens? I have to perform a time series exercise with this.

This is the dataset:

CodePudding user response：

I'm not quite sure what you are trying to do. Could you explain in more detail, what the goal of your operation is? I assume you try to build a moinving (rolling) average with a 5 day intervall across each asset and calculate the mean prices for each intervall.

But first, let me answer why you see all the NaNs: What you are doing with this code below, is thare you are just doing the same operation over and over again and the result of it is always NaNs. That is, because you are doing something weird with the dict and the first rows all have NaNs so average will also be NaNs. And since you overwrite the variable full_df by the result of this computation, your dataframe shows only NaNs.

for i in full_df:
    full_df = full_df.rolling(window=5).mean()

Let me explain in more detail. You were (probably) trying to iterate over the dataframe (using a window of 5 days) and compute the mean. The function full_df.rolling(window=5).mean() already does exactly that, and the output is a new dataframe, with the mean of each window over the entire datafrane full_df. By running this function in a loop, without additional indexing you are only running the same function across the entire dataframe over an over again.

Maybe this will get you what you want:

import pandas as pd

df = pd.read_csv("all_stocks_5yr.csv", index_col=[0,6])

means = df.rolling(window=5).mean()

CodePudding user response：

First off, your loop for i in full_df is not doing what you think; instead of running the rolling mean in each row, you're running it over and over again on the whole dataframe, averaging along columns.

If we just do the rolling average once the way you're implemting it:

full_df = full_df.rolling(window=5).mean()
print(full_df)
          0         1         2         3     ...     1255    1256     1257     1258
A          NaN       NaN       NaN       NaN  ...      NaN     NaN      NaN      NaN
AAL        NaN       NaN       NaN       NaN  ...      NaN     NaN      NaN      NaN
AAP        NaN       NaN       NaN       NaN  ...      NaN     NaN      NaN      NaN
AAPL       NaN       NaN       NaN       NaN  ...      NaN     NaN      NaN      NaN
ABBV  48.56684  48.37228  47.95056  48.07312  ...  102.590  98.768  101.212  100.510
...        ...       ...       ...       ...  ...      ...     ...      ...      ...
XYL   45.58400  45.60000  45.74000  45.96200  ...   64.504  61.854   61.596   61.036
YUM   51.14200  51.01800  51.17400  51.28400  ...   66.902  64.420   63.914   63.668
ZBH   48.59000  48.49200  48.57000  48.75000  ...   75.154  73.112   72.704   72.436
ZION  44.84400  44.76600  44.89400  45.08200  ...   73.972  71.734   71.516   71.580
ZTS   45.08600  45.02600  45.27400  45.39200  ...   83.002  80.224   80.000   80.116

[505 rows x 1259 columns]

The first four rows are all NaN because the rolling mean isn't defined for fewer than 5 rows.

If we do it again (making a total of two times):

full_df = full_df.rolling(window=5).mean()
print(full_df.head(9))
           0          1          2     ...      1256      1257      1258
A           NaN        NaN        NaN  ...       NaN       NaN       NaN
AAL         NaN        NaN        NaN  ...       NaN       NaN       NaN
AAP         NaN        NaN        NaN  ...       NaN       NaN       NaN
AAPL        NaN        NaN        NaN  ...       NaN       NaN       NaN
ABBV        NaN        NaN        NaN  ...       NaN       NaN       NaN
ABC         NaN        NaN        NaN  ...       NaN       NaN       NaN
ABT         NaN        NaN        NaN  ...       NaN       NaN       NaN
ACN         NaN        NaN        NaN  ...       NaN       NaN       NaN
ADBE  49.619072  49.471424  49.192048  ...  108.3420  110.4848  110.4976

You can see the first 8 rows are all NaN since the fourth row reaches down to the eighth in the rolling mean. Given the size of your data frame (505 rows) if you ran the rolling mean 127 times, the entire df would be consumed withNaN values, and your for loop is doing it even more times than that, which is why your df is filled with NaN values.

Also, note that you're averaging across different stock tickers, which doesn't make sense. What I believe you want to be doing is averaging the rows, not the columns in which case you simply need to do

full_df = full_df.rolling(axis = 'columns', window=5).mean()
print(full_df)
      0     1     2     3         4        5     ...     1253     1254     1255     1256     1257     1258
A      NaN   NaN   NaN   NaN  44.72600  44.1600  ...   73.926   73.720   73.006   71.744   70.836   69.762
AAL    NaN   NaN   NaN   NaN  14.42600  14.3760  ...   53.142   53.308   53.114   52.530   52.248   51.664
AAP    NaN   NaN   NaN   NaN  78.74000  78.7600  ...  120.742  120.016  118.074  115.468  114.054  112.642
AAPL   NaN   NaN   NaN   NaN  67.32592  66.9025  ...  168.996  168.330  166.128  163.834  163.046  161.468
ABBV   NaN   NaN   NaN   NaN  35.87200  36.1380  ...  116.384  117.992  116.384  113.824  112.888  113.168
...    ...   ...   ...   ...       ...      ...  ...      ...      ...      ...      ...      ...      ...
XYL    NaN   NaN   NaN   NaN  27.84600  28.0840  ...   73.278   73.598   73.848   73.698   73.350   73.256
YUM    NaN   NaN   NaN   NaN  64.58000  64.3180  ...   85.504   85.168   84.454   83.118   82.316   81.424
ZBH    NaN   NaN   NaN   NaN  75.85600  75.8660  ...  126.284  126.974  126.886  126.044  125.316  124.048
ZION   NaN   NaN   NaN   NaN  24.44200  24.4820  ...   53.838   54.230   54.256   53.748   53.466   53.464
ZTS    NaN   NaN   NaN   NaN  33.37400  33.5600  ...   78.720   78.434   77.772   76.702   75.686   75.112

Again, your first four columns are not managed here.

To correct for that, we add one more term:

full_df = full_df.rolling(axis = 'columns', window=5, min_periods = 1).mean()
print(full_df)
         0        1          2        3         4        5     ...     1253     1254     1255     1256     1257     1258
A     45.0800  44.8400  44.766667  44.7625  44.72600  44.1600  ...   73.926   73.720   73.006   71.744   70.836   69.762
AAL   14.7500  14.6050  14.493333  14.5350  14.42600  14.3760  ...   53.142   53.308   53.114   52.530   52.248   51.664
AAP   78.9000  78.6450  78.630000  78.7150  78.74000  78.7600  ...  120.742  120.016  118.074  115.468  114.054  112.642
AAPL  67.8542  68.2078  67.752800  67.4935  67.32592  66.9025  ...  168.996  168.330  166.128  163.834  163.046  161.468
ABBV  36.2500  36.0500  35.840000  35.6975  35.87200  36.1380  ...  116.384  117.992  116.384  113.824  112.888  113.168
...       ...      ...        ...      ...       ...      ...  ...      ...      ...      ...      ...      ...      ...
XYL   27.0900  27.2750  27.500000  27.6900  27.84600  28.0840  ...   73.278   73.598   73.848   73.698   73.350   73.256
YUM   65.3000  64.9250  64.866667  64.7525  64.58000  64.3180  ...   85.504   85.168   84.454   83.118   82.316   81.424
ZBH   75.8500  75.7500  75.646667  75.7350  75.85600  75.8660  ...  126.284  126.974  126.886  126.044  125.316  124.048
ZION  24.1400  24.1750  24.280000  24.3950  24.44200  24.4820  ...   53.838   54.230   54.256   53.748   53.466   53.464
ZTS   33.0500  33.1550  33.350000  33.4000  33.37400  33.5600  ...   78.720   78.434   77.772   76.702   75.686   75.112

In the above data frame the first column is just the value at time 0, the second is the average of times 0 and 1, the third is the average of times 0, 1, and 2, etc. The window size continues growing until you get to your value of window=5, at which point the window moves along with your rolling average. Note that you can also center the rolling mean if you want to rather than have a trailing window. You can see the documentation here.