In the data below we observe an indexed GDP rate of a certain country over time. My aim is to create a binary variable: 0= no crisis, 1= crisis. If the index is below the past 5 year average, then the variable is coded as 1=crisis until it reaches back the initial 5 year average.
In the example below, the GDP average between 1990 and 1994 is 98. In 1995, the GDP was 96 and hence the variable is coded as 1 until it went up to 99 (beyond the level of 98) in 1999. This scenario happened again between 2002 and 2005 until the GDP level goes back above the 5 year average level of 101.6 of the years 1997 to 2001. Can someone help with this please?
GDP year Expected output
100 1990 0
99 1991 0
98 1992 0
97 1993 0
98 1994 0
96 1995 1
94 1996 1
95 1997 1
97 1998 1
99 1999 0
110 2000 0
107 2001 0
100 2002 1
98 2003 1
99 2004 1
97 2005 1
102 2006 0
103 2007 0
102 2008 0
Here is the data:
df= structure(list(`Index 100` = c(100, 99, 98, 97, 98, 96, 94, 95,
97, 99, 110, 107, 100, 98, 99, 97, 102, 103, 102), year = c(1990,
1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
2002, 2003, 2004, 2005, 2006, 2007, 2008), `Expected output` = c(0,
0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0)), row.names = c(NA,
-19L), class = "data.frame")
CodePudding user response:
You could use slider's slide_dbl
to get the mean
of the prior 5 years per below:
A few notes:
- Without
.complete = TRUE
, it would calculate a mean for partial windows, e.g. prior 3 years in year 4. - If your real dataset has multiple countries, then don't forget to
group_by(country) |>
first. - You could consider using janitor's
clean_names
to ensure the variable names are syntactic, e.g. would not then need backticks.
library(tidyverse)
library(slider)
# Sample data
df <- structure(list(`Index 100` = c(
100, 99, 98, 97, 98, 96, 94, 95,
97, 99, 110, 107, 100, 98, 99, 97, 102, 103, 102
), year = c(
1990,
1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
2002, 2003, 2004, 2005, 2006, 2007, 2008
), `Expected output` = c(
0,
0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0
)), row.names = c(
NA,
-19L
), class = "data.frame")
# Code
df |>
arrange(year) |>
mutate(
sliding_mean = slide_dbl(`Index 100`, mean, .before = 5, .complete = TRUE),
crisis = case_when(
`Index 100` < sliding_mean ~ 1,
`Index 100` >= sliding_mean & `Index 100` < first(na.omit(sliding_mean)) ~ 1,
TRUE ~ 0
)
)
#> Index 100 year Expected output sliding_mean crisis
#> 1 100 1990 0 NA 0
#> 2 99 1991 0 NA 0
#> 3 98 1992 0 NA 0
#> 4 97 1993 0 NA 0
#> 5 98 1994 0 NA 0
#> 6 96 1995 1 98.00000 1
#> 7 94 1996 1 97.00000 1
#> 8 95 1997 1 96.33333 1
#> 9 97 1998 1 96.16667 1
#> 10 99 1999 0 96.50000 0
#> 11 110 2000 0 98.50000 0
#> 12 107 2001 0 100.33333 0
#> 13 100 2002 1 101.33333 1
#> 14 98 2003 1 101.83333 1
#> 15 99 2004 1 102.16667 1
#> 16 97 2005 1 101.83333 1
#> 17 102 2006 0 100.50000 0
#> 18 103 2007 0 99.83333 0
#> 19 102 2008 0 100.16667 0
Created on 2022-07-12 by the reprex package (v2.0.1)
CodePudding user response:
Try this
library(dplyr)
df |> mutate(crisis = case_when(lag(c( rep(0 , 4),
zoo::rollmean(df$GDP , 5)),1)> GDP ~ 1 , TRUE ~ 0))
- output
GDP year crisis
1 100 1990 0
2 99 1991 0
3 98 1992 0
4 97 1993 0
5 98 1994 0
6 96 1995 1
7 94 1996 1
8 95 1997 1
9 97 1998 0
10 99 1999 0
11 110 2000 0
12 107 2001 0
13 100 2002 1
14 98 2003 1
15 99 2004 1
16 97 2005 1
17 102 2006 0
18 103 2007 0
19 102 2008 0
- output to improve understanding about the solution
GDP year crisis Five_years_avg
1 100 1990 0 NA
2 99 1991 0 0.0
3 98 1992 0 0.0
4 97 1993 0 0.0
5 98 1994 0 0.0
6 96 1995 1 98.4
7 94 1996 1 97.6
8 95 1997 1 96.6
9 97 1998 0 96.0
10 99 1999 0 96.0
11 110 2000 0 96.2
12 107 2001 0 99.0
13 100 2002 1 101.6
14 98 2003 1 102.6
15 99 2004 1 102.8
16 97 2005 1 102.8
17 102 2006 0 100.2
18 103 2007 0 99.2
19 102 2008 0 99.8
in 1998 GDP is 97 and the average of the past 5 years was 96 so the is no crisis.