Imagine I have a dataset with observations for a number of individuals across multiple years. Individuals can be in one of two statuses each year, A or B. I have data for which status each individual was in each year and created a dummy variable Status_change
which is equal to 1 if status in the current year is different from the one last year. So my data currently looks something like:
Individual| Year | Status | Status_change |
-------------------------------------------
1 | 1 | A | NA |
1 | 2 | A | 0 |
1 | 3 | A | 0 |
1 | 4 | B | 1 |
What I want is to create a new variable which measures how long the individual has remained in the same status - let's call it Duration
. In the context of the above example, it would look something like:
Individual| Year | Status | Status_change | Duration |
------------------------------------------------------
1 | 1 | A | NA | 0 |
1 | 2 | A | 0 | 1 |
1 | 3 | A | 0 | 2 |
1 | 4 | B | 1 | 0 |
Essentially, I am looking for a variable which is initially 0 for all individuals in year 1 and grows by 1 unit each period as long as the status remains the same. If the status switches, the variable takes the value 0 again and the whole thing starts over. So far I have attempted:
data%>%
group_by(Individual)%>%
arrange(Year, .by_group = TRUE)%>%
mutate(Duration = ifelse(Year == 1, 0, ifelse(Status_Change == 1, 0, lag(Duration) 1)))
But this gives me an error:
Error: Problem with `mutate()` column `Duration`.
i `Duration = ifelse(Year == 1, 0, ifelse(Status_Change == 1, 0, lag(Duration) 1))`.
x could not find function "Duration"
i The error occurred in group 1: Individual = "1"
I would greatly appreciate any help you can give me! Thanks in advance!
CodePudding user response:
This should do it:
library(dplyr)
data |>
group_by(Individual) |>
arrange(Year, .by_group = TRUE) |>
ungroup() |>
mutate(
## Replace the initial NA in Status Change,
## which will break this code.
Status_Change = tidyr::replace_na(Status_Change, 0),
## Create a variable that increases by one every time
## the status changes.
Status_State = cumsum(Status_Change)) |>
## Duration is just the current row_number() for each state of
## individual
group_by(Individual, Status_State) |>
mutate(Duration = row_number()) |>
ungroup()
Note that we can't just group by individual and state - we need to create an intermediary variable that maps changes in state, so that transitions from A to B and back to A again are treated as three states, rather than 2.
CodePudding user response:
How about circumventing the status_change
-variable?
library(dplyr)
df |>
group_by(Individual, Status) |>
arrange(Year) |>
mutate(Duration = row_number()-1) |>
ungroup()
Output:
# A tibble: 4 × 5
individual year status status_change duration
<dbl> <int> <chr> <dbl> <int>
1 1 1 A NA 0
2 1 2 A 0 1
3 1 3 A 0 2
4 1 4 B 1 0
Data (please add some yourself next time :-))
df <- tibble(Individual = c(1,1,1,1),
Year = 1:4,
Status = c("A", "A", "A", "B"),
Status_change = c(NA, 0, 0, 1))