Standard Deviation Per Group With Condition-CodePudding

I have a dataset that looks as follows:

id year value
1  2000  1
1  2002  3
2  1999  10
2  2000  8
2  2001  9
3  2000  12
3  2001  5
3  2003  6
3  2004  7
3  2005  13
3  2006  2
3  2007  1
4  2005  5
4  2006  5
4  2007  5
4  2008  4
4  2009  7

I now want to add a column that indicates the recursively calculated standard deviation per ID, including at least 5 prior years of data (including the respective current year; the fact that years might be missing is ignored), i.e.:

id year value SD
1  2000  1    NA
1  2002  3    NA
2  1999  10   NA
2  2000  8    NA
2  2001  9    NA
3  2000  12   NA
3  2001  5    NA
3  2003  6    NA
3  2004  7    NA
3  2005  13   3.26
3  2006  2    3.86
3  2007  1    4.24
4  2005  5    NA
4  2006  5    NA
4  2007  5    NA
4  2008  4    NA
4  2009  7    0.98

The solution to this can rely on the fact that the data is ordered by id and year. However I also appreciate solutions that do not explicity rely on ordering. Any help is appreciated, thanks.

CodePudding user response：

You can use an exanding window of minimum size 5 on your grouped and ordered data:

df['SD'] = df.groupby('id').value.apply(lambda x: x.expanding(5).std(ddof=0))

Result:

    id  year  value        SD
0    1  2000      1       NaN
1    1  2002      3       NaN
2    2  1999     10       NaN
3    2  2000      8       NaN
4    2  2001      9       NaN
5    3  2000     12       NaN
6    3  2001      5       NaN
7    3  2003      6       NaN
8    3  2004      7       NaN
9    3  2005     13  3.261901
10   3  2006      2  3.862210
11   3  2007      1  4.237828
12   4  2005      5       NaN
13   4  2006      5       NaN
14   4  2007      5       NaN
15   4  2008      4       NaN
16   4  2009      7  0.979796

(by default, pandas uses ddf (delta degrees of freedom) of 1, unlike numpy which uses 0 by default)

CodePudding user response：

Use DataFrame.pivot with DataFrame.reindex for all years and then count DataFrame.expanding with std, last resahpe back by DataFrame.stack and add new column in DataFrame.join:

df1 = df.pivot('id','year','value')
s = (df1.reindex(range(df1.columns.min(), df1.columns.max()   1), axis=1)
        .expanding(5, axis=1).std(ddof=0)
        .stack())

df1 = df.join(s.rename('SD'), on=['id','year'])
print (df1)
    id  year  value        SD
0    1  2000      1       NaN
1    1  2002      3       NaN
2    2  1999     10       NaN
3    2  2000      8       NaN
4    2  2001      9       NaN
5    3  2000     12       NaN
6    3  2001      5       NaN
7    3  2003      6       NaN
8    3  2004      7       NaN
9    3  2005     13  3.261901
10   3  2006      2  3.862210
11   3  2007      1  4.237828
12   4  2005      5       NaN
13   4  2006      5       NaN
14   4  2007      5       NaN
15   4  2008      4       NaN
16   4  2009      7  0.979796