I have a dataset that looks as follows:
id year value
1 2000 1
1 2002 3
2 1999 10
2 2000 8
2 2001 9
3 2000 12
3 2001 5
3 2003 6
3 2004 7
3 2005 13
3 2006 2
3 2007 1
4 2005 5
4 2006 5
4 2007 5
4 2008 4
4 2009 7
I now want to add a column that indicates the recursively calculated standard deviation per ID, including at least 5 prior years of data (including the respective current year; the fact that years might be missing is ignored), i.e.:
id year value SD
1 2000 1 NA
1 2002 3 NA
2 1999 10 NA
2 2000 8 NA
2 2001 9 NA
3 2000 12 NA
3 2001 5 NA
3 2003 6 NA
3 2004 7 NA
3 2005 13 3.26
3 2006 2 3.86
3 2007 1 4.24
4 2005 5 NA
4 2006 5 NA
4 2007 5 NA
4 2008 4 NA
4 2009 7 0.98
The solution to this can rely on the fact that the data is ordered by id and year. However I also appreciate solutions that do not explicity rely on ordering. Any help is appreciated, thanks.
CodePudding user response:
You can use an exanding window of minimum size 5
on your grouped and ordered data:
df['SD'] = df.groupby('id').value.apply(lambda x: x.expanding(5).std(ddof=0))
Result:
id year value SD
0 1 2000 1 NaN
1 1 2002 3 NaN
2 2 1999 10 NaN
3 2 2000 8 NaN
4 2 2001 9 NaN
5 3 2000 12 NaN
6 3 2001 5 NaN
7 3 2003 6 NaN
8 3 2004 7 NaN
9 3 2005 13 3.261901
10 3 2006 2 3.862210
11 3 2007 1 4.237828
12 4 2005 5 NaN
13 4 2006 5 NaN
14 4 2007 5 NaN
15 4 2008 4 NaN
16 4 2009 7 0.979796
(by default, pandas uses ddf (delta degrees of freedom) of 1, unlike numpy which uses 0 by default)
CodePudding user response:
Use DataFrame.pivot
with DataFrame.reindex
for all years and then count DataFrame.expanding
with std
, last resahpe back by DataFrame.stack
and add new column in DataFrame.join
:
df1 = df.pivot('id','year','value')
s = (df1.reindex(range(df1.columns.min(), df1.columns.max() 1), axis=1)
.expanding(5, axis=1).std(ddof=0)
.stack())
df1 = df.join(s.rename('SD'), on=['id','year'])
print (df1)
id year value SD
0 1 2000 1 NaN
1 1 2002 3 NaN
2 2 1999 10 NaN
3 2 2000 8 NaN
4 2 2001 9 NaN
5 3 2000 12 NaN
6 3 2001 5 NaN
7 3 2003 6 NaN
8 3 2004 7 NaN
9 3 2005 13 3.261901
10 3 2006 2 3.862210
11 3 2007 1 4.237828
12 4 2005 5 NaN
13 4 2006 5 NaN
14 4 2007 5 NaN
15 4 2008 4 NaN
16 4 2009 7 0.979796