I have a dataset which contains information on the number of fatalities for state-based conflicts (ged_sb_best_sum_nokgi
) for different coutries and for the years 1989-2021.
The dataset is called conflict
and looks something like this:
isoname year ged_sb_best_sum_nokgi
Afghanistan 1989 5174
Afghanistan 1990 5143
Afghanistan 1991 763
Afghanistan 1992 15
...
Afghanistan 2021 34
...
Zimbabwe 1989 13
Zimbabwe 1990 57
Zimbabwe 1991 124
...
Zimbabwe 2021 5
I have created a dummy equal to 1 if the number of fatalities is => 25, as so:
conflict <- conflict %>%
group_by(isoname)%>%
mutate(conf_incidence_sb = ifelse(ged_sb_best_sum_nokgi >= 25,1,0))
And I get the variable conf_incidence_sb
, which takes the following values:
isoname year ged_sb_best_sum_nokgi conf_incidence_sb
Afghanistan 1989 5174 1
Afghanistan 1990 5143 1
Afghanistan 1991 763 1
Afghanistan 1992 15 0
...
Afghanistan 2021 34 1
...
Zimbabwe 1989 13 0
Zimbabwe 1990 57 1
Zimbabwe 1991 124 1
...
Zimbabwe 2021 5 0
The next thing I would like to do is create a variable called conflict_duration_sb
, which counts the number of years the conflict has been ongoing since it started. The dataset should look something like this:
isoname year ged_sb_best_sum_nokgi conf_incidence_sb conf_duration_sb
Afghanistan 1989 5174 1 1
Afghanistan 1990 5143 1 2
Afghanistan 1991 763 1 3
Afghanistan 1992 15 0 0
...
Afghanistan 2021 34 1 1
...
Zimbabwe 1989 13 0 0
Zimbabwe 1990 57 1 1
Zimbabwe 1991 124 1 2
...
Zimbabwe 2021 5 0 0
CodePudding user response:
Here's two solutions. First I'll make some toy data with 2 countries and random conflict
variable:
library(dplyr)
dat <- data.frame(country=rep(c("UK", "France"), each=10), conflict=rbinom(20,1,0.5))
First approach is just using dplyr
. You use cumsum
twice, first to group up the data into conflicts, then second to find the cumulative duration.
dat |>
group_by(country, cumsum(conflict==0)) |>
mutate(duration=cumsum(conflict))
country conflict `cumsum(conflict == 0)` duration
<chr> <int> <int> <int>
1 UK 0 1 0
2 UK 1 1 1
3 UK 1 1 2
4 UK 0 2 0
5 UK 0 3 0
6 UK 1 3 1
7 UK 1 3 2
8 UK 1 3 3
9 UK 1 3 4
10 UK 0 4 0
11 France 1 4 1
12 France 0 5 0
13 France 0 6 0
14 France 0 7 0
15 France 0 8 0
16 France 0 9 0
17 France 1 9 1
18 France 1 9 2
19 France 1 9 3
20 France 0 10 0
Second approach is via the cumsum_reset
function in the package hutilscpp
. This requires a logical vector, so you convert it first with as.logical
.
dat |>
group_by(country) |>
mutate(duration=hutilscpp::cumsum_reset(as.logical(conflict)))
country conflict duration
<chr> <int> <int>
1 UK 0 0
2 UK 1 1
3 UK 1 2
4 UK 0 0
5 UK 0 0
6 UK 1 1
7 UK 1 2
8 UK 1 3
9 UK 1 4
10 UK 0 0
11 France 1 1
12 France 0 0
13 France 0 0
14 France 0 0
15 France 0 0
16 France 0 0
17 France 1 1
18 France 1 2
19 France 1 3
20 France 0 0
CodePudding user response:
Here is another approach using group_by and mutate twice:
library(dplyr)
df %>%
group_by(isoname) %>%
mutate(incidence = ifelse(ged_sb_best_sum_nokgi >= 25,1,0),
duration = cumsum(incidence != lag(incidence, def = first(incidence)))) %>%
group_by(isoname, duration) %>%
mutate(duration = ifelse(incidence==1, row_number(), 0))
isoname year ged_sb_best_sum_nokgi incidence duration
<chr> <int> <int> <dbl> <dbl>
1 Afghanistan 1989 5174 1 1
2 Afghanistan 1990 5143 1 2
3 Afghanistan 1991 763 1 3
4 Afghanistan 1992 15 0 0
5 Afghanistan 2021 34 1 1
6 Zimbabwe 1989 13 0 0
7 Zimbabwe 1990 57 1 1
8 Zimbabwe 1991 124 1 2
9 Zimbabwe 2021 5 0 0
>