I have a data frame containing information for each country-year on natural disasters. The structure is similar to the following:
test <- data.frame(year = rep(c(1990:1995) , times = 3) ,
country = rep(c("aaa" , "bbb" , "ccc") , each = 6) ,
incident = c(NA , "Drought" , rep(NA , times = 7) , "Flood" , "Flood" , NA ,
rep(NA , times = 6)) )
Giving:
year country incident
1 1990 aaa <NA>
2 1991 aaa Drought
3 1992 aaa <NA>
4 1993 aaa <NA>
5 1994 aaa <NA>
6 1995 aaa <NA>
7 1990 bbb <NA>
8 1991 bbb <NA>
9 1992 bbb <NA>
10 1993 bbb Flood
11 1994 bbb Flood
12 1995 bbb <NA>
13 1990 ccc <NA>
14 1991 ccc <NA>
15 1992 ccc <NA>
16 1993 ccc <NA>
17 1994 ccc <NA>
18 1995 ccc <NA>
I would like to introduce a new variable into this data frame that counts the number of years for each country since the last incident, and replace each string encountered with 0. Something like this:
year country incident years_since
1 1990 aaa <NA> NA
2 1991 aaa Drought 0
3 1992 aaa <NA> 1
4 1993 aaa <NA> 2
5 1994 aaa <NA> 3
6 1995 aaa <NA> 4
7 1990 bbb <NA> NA
8 1991 bbb <NA> NA
9 1992 bbb <NA> NA
10 1993 bbb Flood 0
11 1994 bbb Flood 0
12 1995 bbb <NA> 1
13 1990 ccc <NA> NA
14 1991 ccc <NA> NA
15 1992 ccc <NA> NA
16 1993 ccc <NA> NA
17 1994 ccc <NA> NA
18 1995 ccc <NA> NA
Is there a way to do this efficiently using one line of code?
CodePudding user response:
1) Create a years_since function based on counting the number of positions since the indicated cumsum and apply it by country.
library(dplyr)
years_since <- function(x) {
cs <- cumsum(!is.na(x))
ifelse(cs, seq_along(cs) - match(cs, cs), NA)
}
test %>%
group_by(country) %>%
mutate(years_since = years_since(incident)) %>%
ungroup
giving:
# A tibble: 18 × 4
year country incident years_since
<int> <chr> <chr> <int>
1 1990 aaa <NA> NA
2 1991 aaa Drought 0
3 1992 aaa <NA> 1
4 1993 aaa <NA> 2
5 1994 aaa <NA> 3
6 1995 aaa <NA> 4
7 1990 bbb <NA> NA
8 1991 bbb <NA> NA
9 1992 bbb <NA> NA
10 1993 bbb Flood 0
11 1994 bbb Flood 0
12 1995 bbb <NA> 1
13 1990 ccc <NA> NA
14 1991 ccc <NA> NA
15 1992 ccc <NA> NA
16 1993 ccc <NA> NA
17 1994 ccc <NA> NA
18 1995 ccc <NA> NA
2) Alternately using only base R:
transform(test, years_since = ave(incident, country, FUN = years_since))
CodePudding user response:
There might be better solutions, but here's one with cumsum
cumany
library(dplyr)
test %>%
group_by(country) %>%
mutate(years_since = ifelse(cumany(!is.na(incident)),
cumsum(cumsum(lag(!is.na(incident), default = 0) & is.na(incident))),
NA))
output
# A tibble: 18 × 4
# Groups: country [3]
year country incident years_since
<int> <chr> <chr> <int>
1 1990 aaa NA NA
2 1991 aaa Drought 0
3 1992 aaa NA 1
4 1993 aaa NA 2
5 1994 aaa NA 3
6 1995 aaa NA 4
7 1990 bbb NA NA
8 1991 bbb NA NA
9 1992 bbb NA NA
10 1993 bbb Flood 0
11 1994 bbb Flood 0
12 1995 bbb NA 1
13 1990 ccc NA NA
14 1991 ccc NA NA
15 1992 ccc NA NA
16 1993 ccc NA NA
17 1994 ccc NA NA
18 1995 ccc NA NA