How to calculate the duration of a variable?-CodePudding

I have a dataset which contains information on the number of fatalities for state-based conflicts (ged_sb_best_sum_nokgi) for different coutries and for the years 1989-2021.

The dataset is called conflict and looks something like this:

isoname        year  ged_sb_best_sum_nokgi
Afghanistan    1989      5174
Afghanistan    1990      5143
Afghanistan    1991      763
Afghanistan    1992       15
...
Afghanistan    2021       34 
...
Zimbabwe       1989       13
Zimbabwe       1990       57
Zimbabwe       1991       124
... 
Zimbabwe       2021       5

I have created a dummy equal to 1 if the number of fatalities is => 25, as so:

conflict <- conflict %>% 
 group_by(isoname)%>% 
 mutate(conf_incidence_sb = ifelse(ged_sb_best_sum_nokgi >= 25,1,0))

And I get the variable conf_incidence_sb, which takes the following values:

isoname        year  ged_sb_best_sum_nokgi   conf_incidence_sb
Afghanistan    1989      5174                          1
Afghanistan    1990      5143                          1
Afghanistan    1991      763                           1
Afghanistan    1992       15                           0
...
Afghanistan    2021       34                           1
...
Zimbabwe       1989       13                           0
Zimbabwe       1990       57                           1
Zimbabwe       1991       124                          1
... 
Zimbabwe       2021       5                            0

The next thing I would like to do is create a variable called conflict_duration_sb, which counts the number of years the conflict has been ongoing since it started. The dataset should look something like this:

isoname        year  ged_sb_best_sum_nokgi   conf_incidence_sb   conf_duration_sb
Afghanistan    1989      5174                          1                   1
Afghanistan    1990      5143                          1                   2
Afghanistan    1991      763                           1                   3
Afghanistan    1992       15                           0                   0
...
Afghanistan    2021       34                           1                   1                
...
Zimbabwe       1989       13                           0                   0
Zimbabwe       1990       57                           1                   1
Zimbabwe       1991       124                          1                   2
... 
Zimbabwe       2021       5                            0                   0

CodePudding user response：

Here's two solutions. First I'll make some toy data with 2 countries and random conflict variable:

library(dplyr)

dat <- data.frame(country=rep(c("UK", "France"), each=10), conflict=rbinom(20,1,0.5))

First approach is just using dplyr. You use cumsum twice, first to group up the data into conflicts, then second to find the cumulative duration.

dat |> 
  group_by(country, cumsum(conflict==0)) |> 
  mutate(duration=cumsum(conflict))

   country conflict `cumsum(conflict == 0)` duration
   <chr>      <int>                   <int>    <int>
 1 UK             0                       1        0
 2 UK             1                       1        1
 3 UK             1                       1        2
 4 UK             0                       2        0
 5 UK             0                       3        0
 6 UK             1                       3        1
 7 UK             1                       3        2
 8 UK             1                       3        3
 9 UK             1                       3        4
10 UK             0                       4        0
11 France         1                       4        1
12 France         0                       5        0
13 France         0                       6        0
14 France         0                       7        0
15 France         0                       8        0
16 France         0                       9        0
17 France         1                       9        1
18 France         1                       9        2
19 France         1                       9        3
20 France         0                      10        0

Second approach is via the cumsum_reset function in the package hutilscpp. This requires a logical vector, so you convert it first with as.logical.

dat |> 
  group_by(country) |>
  mutate(duration=hutilscpp::cumsum_reset(as.logical(conflict)))

  country conflict duration
   <chr>      <int>    <int>
 1 UK             0        0
 2 UK             1        1
 3 UK             1        2
 4 UK             0        0
 5 UK             0        0
 6 UK             1        1
 7 UK             1        2
 8 UK             1        3
 9 UK             1        4
10 UK             0        0
11 France         1        1
12 France         0        0
13 France         0        0
14 France         0        0
15 France         0        0
16 France         0        0
17 France         1        1
18 France         1        2
19 France         1        3
20 France         0        0

CodePudding user response：

Here is another approach using group_by and mutate twice:

library(dplyr)

df %>% 
  group_by(isoname) %>% 
  mutate(incidence = ifelse(ged_sb_best_sum_nokgi >= 25,1,0), 
         duration = cumsum(incidence != lag(incidence, def = first(incidence)))) %>% 
  group_by(isoname, duration) %>% 
  mutate(duration = ifelse(incidence==1, row_number(), 0))

 isoname      year ged_sb_best_sum_nokgi incidence duration
  <chr>       <int>                 <int>     <dbl>    <dbl>
1 Afghanistan  1989                  5174         1        1
2 Afghanistan  1990                  5143         1        2
3 Afghanistan  1991                   763         1        3
4 Afghanistan  1992                    15         0        0
5 Afghanistan  2021                    34         1        1
6 Zimbabwe     1989                    13         0        0
7 Zimbabwe     1990                    57         1        1
8 Zimbabwe     1991                   124         1        2
9 Zimbabwe     2021                     5         0        0
>