How to create a column that is dependent on several events over time with a certain threshold?-CodePudding

In the data below we observe an indexed GDP rate of a certain country over time. My aim is to create a binary variable: 0= no crisis, 1= crisis. If the index is below the past 5 year average, then the variable is coded as 1=crisis until it reaches back the initial 5 year average.

In the example below, the GDP average between 1990 and 1994 is 98. In 1995, the GDP was 96 and hence the variable is coded as 1 until it went up to 99 (beyond the level of 98) in 1999. This scenario happened again between 2002 and 2005 until the GDP level goes back above the 5 year average level of 101.6 of the years 1997 to 2001. Can someone help with this please?

GDP year Expected output
100 1990    0
99  1991    0
98  1992    0
97  1993    0
98  1994    0
96  1995    1
94  1996    1
95  1997    1
97  1998    1
99  1999    0
110 2000    0
107 2001    0
100 2002    1
98  2003    1
99  2004    1
97  2005    1
102 2006    0
103 2007    0
102 2008    0

Here is the data:

df= structure(list(`Index 100` = c(100, 99, 98, 97, 98, 96, 94, 95, 
97, 99, 110, 107, 100, 98, 99, 97, 102, 103, 102), year = c(1990, 
1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 
2002, 2003, 2004, 2005, 2006, 2007, 2008), `Expected output` = c(0, 
0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0)), row.names = c(NA, 
-19L), class = "data.frame")

CodePudding user response：

You could use slider's slide_dbl to get the mean of the prior 5 years per below:

A few notes:

Without .complete = TRUE, it would calculate a mean for partial windows, e.g. prior 3 years in year 4.
If your real dataset has multiple countries, then don't forget to group_by(country) |> first.
You could consider using janitor's clean_names to ensure the variable names are syntactic, e.g. would not then need backticks.

library(tidyverse)
library(slider)

# Sample data
df <- structure(list(`Index 100` = c(
  100, 99, 98, 97, 98, 96, 94, 95,
  97, 99, 110, 107, 100, 98, 99, 97, 102, 103, 102
), year = c(
  1990,
  1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
  2002, 2003, 2004, 2005, 2006, 2007, 2008
), `Expected output` = c(
  0,
  0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0
)), row.names = c(
  NA,
  -19L
), class = "data.frame")

# Code
df |>
  arrange(year) |>
  mutate(
    sliding_mean = slide_dbl(`Index 100`, mean, .before = 5, .complete = TRUE),
    crisis = case_when(
      `Index 100` < sliding_mean ~ 1,
      `Index 100` >= sliding_mean & `Index 100` < first(na.omit(sliding_mean)) ~ 1,
      TRUE ~ 0
    )
  )
#>    Index 100 year Expected output sliding_mean crisis
#> 1        100 1990               0           NA      0
#> 2         99 1991               0           NA      0
#> 3         98 1992               0           NA      0
#> 4         97 1993               0           NA      0
#> 5         98 1994               0           NA      0
#> 6         96 1995               1     98.00000      1
#> 7         94 1996               1     97.00000      1
#> 8         95 1997               1     96.33333      1
#> 9         97 1998               1     96.16667      1
#> 10        99 1999               0     96.50000      0
#> 11       110 2000               0     98.50000      0
#> 12       107 2001               0    100.33333      0
#> 13       100 2002               1    101.33333      1
#> 14        98 2003               1    101.83333      1
#> 15        99 2004               1    102.16667      1
#> 16        97 2005               1    101.83333      1
#> 17       102 2006               0    100.50000      0
#> 18       103 2007               0     99.83333      0
#> 19       102 2008               0    100.16667      0

^{Created on 2022-07-12 by the reprex package (v2.0.1)}

CodePudding user response：

Try this

library(dplyr)

df |> mutate(crisis = case_when(lag(c( rep(0 , 4),
zoo::rollmean(df$GDP , 5)),1)> GDP ~ 1 , TRUE ~ 0))

output

   GDP year crisis
1  100 1990      0
2   99 1991      0
3   98 1992      0
4   97 1993      0
5   98 1994      0
6   96 1995      1
7   94 1996      1
8   95 1997      1
9   97 1998      0
10  99 1999      0
11 110 2000      0
12 107 2001      0
13 100 2002      1
14  98 2003      1
15  99 2004      1
16  97 2005      1
17 102 2006      0
18 103 2007      0
19 102 2008      0

output to improve understanding about the solution

   GDP year crisis Five_years_avg
1  100 1990      0             NA
2   99 1991      0            0.0
3   98 1992      0            0.0
4   97 1993      0            0.0
5   98 1994      0            0.0
6   96 1995      1           98.4
7   94 1996      1           97.6
8   95 1997      1           96.6
9   97 1998      0           96.0
10  99 1999      0           96.0
11 110 2000      0           96.2
12 107 2001      0           99.0
13 100 2002      1          101.6
14  98 2003      1          102.6
15  99 2004      1          102.8
16  97 2005      1          102.8
17 102 2006      0          100.2
18 103 2007      0           99.2
19 102 2008      0           99.8

in 1998 GDP is 97 and the average of the past 5 years was 96 so the is no crisis.