How to create a column that is dependent on the average of previous observed events?-CodePudding

In the data below we observe a virtual GDP growth of a certain country over time. My aim is to create a variable with three categories: 0= no crisis, 1= crisis, 2= severe crisis. That would be identify economic crises as years where the growth rate reaches at least one (crisis) or two (severe) standard deviations below the average of the previous 3-year growth trend.

Could give some guidance please?

growth  year
    5   1990
    4   1991
    0   1992
    -4  1993
    -3  1994
    -1  1995
    2   1996
    4   1997
    7   1998
    10  1999
    8   2000
    -10 2001
    -8  2002
    2   2003
    4   2004
    5   2005
    8   2006
    4   2007
    -10 2008
    -9  2009
    -8  2010
    -3  2011
    0   2012
    -5  2013
    -6  2014
    -2  2015
    4   2016
    5   2017
    5   2018
    8   2019
    2   2020
    -1  2021
    -1  2022

Here is the data:

df=structure(list(gdp_growth = c(5, 4, 0, -4, -3, -1, 2, 4, 7, 10, 
    8, -10, -8, 2, 4, 5, 8, 4, -10, -9, -8, -3, 0, -5, -6, -2, 4, 
    5, 5, 8, 2, -1, -1), year = c(1990, 1991, 1992, 1993, 1994, 1995, 
    1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 
    2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 
    2018, 2019, 2020, 2021, 2022)), row.names = c(NA, -33L), class = "data.frame")

CodePudding user response：

From your description, it sounds as though you first need to calculate the rolling mean of growth, then compare the current year's growth to this:

library(dplyr)

df %>% 
  mutate(mn = zoo::rollmean(gdp_growth, 3, na.pad = TRUE, align = 'right'),
         crisis = ifelse(gdp_growth < (mn - sd(gdp_growth)),
                         ifelse(gdp_growth < (mn - 2 * sd(gdp_growth)),
                                2, 1), 0)) %>%
  select(-mn)

#>    gdp_growth year crisis
#> 1           5 1990     NA
#> 2           4 1991     NA
#> 3           0 1992      0
#> 4          -4 1993      0
#> 5          -3 1994      0
#> 6          -1 1995      0
#> 7           2 1996      0
#> 8           4 1997      0
#> 9           7 1998      0
#> 10         10 1999      0
#> 11          8 2000      0
#> 12        -10 2001      2
#> 13         -8 2002      0
#> 14          2 2003      0
#> 15          4 2004      0
#> 16          5 2005      0
#> 17          8 2006      0
#> 18          4 2007      0
#> 19        -10 2008      1
#> 20         -9 2009      0
#> 21         -8 2010      0
#> 22         -3 2011      0
#> 23          0 2012      0
#> 24         -5 2013      0
#> 25         -6 2014      0
#> 26         -2 2015      0
#> 27          4 2016      0
#> 28          5 2017      0
#> 29          5 2018      0
#> 30          8 2019      0
#> 31          2 2020      0
#> 32         -1 2021      0
#> 33         -1 2022      0

CodePudding user response：

Here's another example, this time using RcppRoll package which has loads of fast rolling functions compatible with dplyr.

library(dplyr)


df %>%
  mutate(
    std3 = RcppRoll::roll_sd(gdp_growth , 3, fill=0, align = "right"),
    crisis = case_when(
      std3 < 1 ~ 'no crisis',
      std3 < 2 ~ 'crisis',
      T ~ 'severe crisis'
    )
  )
#>    gdp_growth year       std3        crisis
#> 1           5 1990  0.0000000     no crisis
#> 2           4 1991  0.0000000     no crisis
#> 3           0 1992  2.6457513 severe crisis
#> 4          -4 1993  4.0000000 severe crisis
#> 5          -3 1994  2.0816660 severe crisis
#> 6          -1 1995  1.5275252        crisis
#> 7           2 1996  2.5166115 severe crisis
#> 8           4 1997  2.5166115 severe crisis
#> 9           7 1998  2.5166115 severe crisis
#> 10         10 1999  3.0000000 severe crisis
#> 11          8 2000  1.5275252        crisis
#> 12        -10 2001 11.0151411 severe crisis
#> 13         -8 2002  9.8657657 severe crisis
#> 14          2 2003  6.4291005 severe crisis
#> 15          4 2004  6.4291005 severe crisis
#> 16          5 2005  1.5275252        crisis
#> 17          8 2006  2.0816660 severe crisis
#> 18          4 2007  2.0816660 severe crisis
#> 19        -10 2008  9.4516313 severe crisis
#> 20         -9 2009  7.8102497 severe crisis
#> 21         -8 2010  1.0000000        crisis
#> 22         -3 2011  3.2145503 severe crisis
#> 23          0 2012  4.0414519 severe crisis
#> 24         -5 2013  2.5166115 severe crisis
#> 25         -6 2014  3.2145503 severe crisis
#> 26         -2 2015  2.0816660 severe crisis
#> 27          4 2016  5.0332230 severe crisis
#> 28          5 2017  3.7859389 severe crisis
#> 29          5 2018  0.5773503     no crisis
#> 30          8 2019  1.7320508        crisis
#> 31          2 2020  3.0000000 severe crisis
#> 32         -1 2021  4.5825757 severe crisis
#> 33         -1 2022  1.7320508        crisis

^{Created on 2022-07-11 by the reprex package (v2.0.1)}

CodePudding user response：

You could use lag, rowwise*, and mutate within dplyr:

library(dplyr)

df |>
  mutate(gdp3_growth_lag1 = lag(gdp_growth, 1),
         gdp3_growth_lag2 = lag(gdp_growth, 2),
         gdp3_growth_lag3 = lag(gdp_growth, 3)) |>
  rowwise() |>
    mutate(
      gdp3_growth_mean = mean(c_across(starts_with("gdp3_growth_lag"))),
      gdp3_growth_sd = sd(c_across(starts_with("gdp3_growth_lag")))
    ) |>
  ungroup() |>
  mutate(crisis = case_when(gdp_growth <= gdp3_growth_mean - 2 * gdp3_growth_sd ~ 2,
                            gdp_growth <= gdp3_growth_mean - gdp3_growth_sd ~ 1,
                            is.na(gdp3_growth_mean) ~ NA_real_,
                            TRUE ~ 0)) |>
  select(-starts_with("gdp3"))

Output:

# A tibble: 33 × 3
   gdp_growth  year crisis
        <dbl> <dbl>  <dbl>
 1          5  1990     NA
 2          4  1991     NA
 3          0  1992     NA
 4         -4  1993      2
 5         -3  1994      0
 6         -1  1995      0
 7          2  1996      0
 8          4  1997      0
 9          7  1998      0
10         10  1999      0
11          8  2000      0
12        -10  2001      2
13         -8  2002      0
14          2  2003      0
15          4  2004      0
16          5  2005      0
17          8  2006      0
18          4  2007      0
19        -10  2008      2
20         -9  2009      1
21         -8  2010      0
22         -3  2011      0
23          0  2012      0
24         -5  2013      0
25         -6  2014      1
26         -2  2015      0
27          4  2016      0
28          5  2017      0
29          5  2018      0
30          8  2019      0
31          2  2020      2
32         -1  2021      2
33         -1  2022      0

Updated with full output.

(*) There is also rowSds in matrixStats