Home > Software design >  Create Duration Variable In R
Create Duration Variable In R

Time:12-15

Imagine I have a dataset with observations for a number of individuals across multiple years. Individuals can be in one of two statuses each year, A or B. I have data for which status each individual was in each year and created a dummy variable Status_change which is equal to 1 if status in the current year is different from the one last year. So my data currently looks something like:

Individual| Year | Status | Status_change |
-------------------------------------------
    1     |  1   |   A    |      NA       |
    1     |  2   |   A    |      0        |
    1     |  3   |   A    |      0        |
    1     |  4   |   B    |      1        |

What I want is to create a new variable which measures how long the individual has remained in the same status - let's call it Duration. In the context of the above example, it would look something like:

Individual| Year | Status | Status_change | Duration |
------------------------------------------------------
    1     |  1   |   A    |      NA       |     0    |
    1     |  2   |   A    |      0        |     1    |
    1     |  3   |   A    |      0        |     2    |
    1     |  4   |   B    |      1        |     0    | 

Essentially, I am looking for a variable which is initially 0 for all individuals in year 1 and grows by 1 unit each period as long as the status remains the same. If the status switches, the variable takes the value 0 again and the whole thing starts over. So far I have attempted:

data%>%
  group_by(Individual)%>%
  arrange(Year, .by_group = TRUE)%>%
  mutate(Duration = ifelse(Year == 1, 0, ifelse(Status_Change == 1, 0, lag(Duration)   1)))

But this gives me an error:

Error: Problem with `mutate()` column `Duration`.
i `Duration = ifelse(Year == 1, 0, ifelse(Status_Change == 1, 0, lag(Duration)   1))`.
x could not find function "Duration"
i The error occurred in group 1: Individual = "1"

I would greatly appreciate any help you can give me! Thanks in advance!

CodePudding user response:

This should do it:

library(dplyr)

data |>
  group_by(Individual) |>
  arrange(Year, .by_group = TRUE) |>
  ungroup() |> 
  mutate(
         ## Replace the initial NA in Status Change,
         ## which will break this code.
         Status_Change = tidyr::replace_na(Status_Change, 0),
         ## Create a variable that increases by one every time
         ## the status changes.
         Status_State  = cumsum(Status_Change)) |>
  ## Duration is just the current row_number() for each state of
  ## individual
  group_by(Individual, Status_State) |> 
  mutate(Duration = row_number()) |> 
  ungroup()

Note that we can't just group by individual and state - we need to create an intermediary variable that maps changes in state, so that transitions from A to B and back to A again are treated as three states, rather than 2.

CodePudding user response:

How about circumventing the status_change-variable?

library(dplyr)

df |>
  group_by(Individual, Status) |>
  arrange(Year) |>
  mutate(Duration = row_number()-1) |>
  ungroup()

Output:

# A tibble: 4 × 5
  individual  year status status_change duration
       <dbl> <int> <chr>          <dbl>    <int>
1          1     1 A                 NA        0
2          1     2 A                  0        1
3          1     3 A                  0        2
4          1     4 B                  1        0

Data (please add some yourself next time :-))

df <- tibble(Individual = c(1,1,1,1),
             Year = 1:4,
             Status = c("A", "A", "A", "B"),
             Status_change = c(NA, 0, 0, 1))
  •  Tags:  
  • r
  • Related