My aim is to create two variables (unemployed
and inactive
) in an unbalanced panel data that shows are based on certain events. The status variable is 0= employed
, 1= unemployed
and 2=inactive
. First, the unemployed
variable that i would like to create shows a transition from only 0 to 1. The inactive
variable that i would like to create shows a transition from only 0 to 2. If an individual starts with 1 or 2 on the first observation (e.g., like individual with id
6), these are not considered as events. Finally we can see that individuals might drop one or more waves, however, this should be irrelevant to the variables we want to create. In other words, our individual with id
1 has dropped wave
3 and 4, but then came back in wave
5 and we considered it as an unemployed
event.
Here is the expected output:
id wave status unemployed inactive
1 1 1 0 0 0
2 1 2 0 0 0
3 1 5 1 1 0
4 2 1 0 0 0
5 2 2 0 0 0
6 3 1 0 0 0
7 3 2 2 0 1
8 3 3 1 0 0
9 4 1 0 0 0
10 4 2 2 0 1
11 4 4 0 0 0
12 5 1 2 0 0
13 5 3 0 0 0
14 5 5 1 1 0
15 5 6 1 0 0
16 6 1 1 0 0
17 6 3 2 0 1
18 6 5 2 0 0
Here is the data:
df=structure(list(id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5,
5, 5, 6, 6, 6), wave = c(1, 2, 5, 1, 2, 1, 2, 3, 1, 2, 4, 1,
3, 5, 6, 1, 3, 5), status = c(0, 0, 1, 0, 0, 0, 2, 1, 0, 2, 0,
2, 0, 1, 1, 1, 2, 2)), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"))
CodePudding user response:
A dplyr
solution:
library(dplyr)
df |>
group_by(id) |>
mutate(
unemployed = (status == 1) & (lag(status, default = status[1]) == 0),
inactive = (status == 2) & (lag(status, default = status[1]) != 2),
)
# A tibble: 18 x 5
# Groups: id [6]
# id wave status unemployed inactive
# <dbl> <dbl> <dbl> <lgl> <lgl>
# 1 1 1 0 FALSE FALSE
# 2 1 2 0 FALSE FALSE
# 3 1 5 1 TRUE FALSE
# 4 2 1 0 FALSE FALSE
# 5 2 2 0 FALSE FALSE
# 6 3 1 0 FALSE FALSE
# 7 3 2 2 FALSE TRUE
# 8 3 3 1 FALSE FALSE
# 9 4 1 0 FALSE FALSE
# 10 4 2 2 FALSE TRUE
# 11 4 4 0 FALSE FALSE
# 12 5 1 2 FALSE FALSE
# 13 5 3 0 FALSE FALSE
# 14 5 5 1 TRUE FALSE
# 15 5 6 1 FALSE FALSE
# 16 6 1 1 FALSE FALSE
# 17 6 3 2 FALSE TRUE
# 18 6 5 2 FALSE FALSE
I have left them as logical
rather than numeric
variables as I think that is the appropriate data type in this case, but you can change that by wrapping the relevant part in as.numeric()
, e.g. unemployed = as.numeric((status == 1) & (lag(status, default = status[1]) == 0))
.
I have assumed that:
- A person is unemployed only if they transition from being employed.
- A person is inactive if they move to being inactive from being employed or unemployed.
- A person should have the transition flag set to
TRUE
in the first period if they are inactive or unemployed in the first period - that is whatdefault = status[1]
is doing.
Also just for fun here is a data.table
solution:
library(data.table)
dt <- setDT(df)
dt[,
`:=` (
unemployed = (status == 1) & (shift(status, type = "lag", fill = status[1]) == 0),
inactive = (status == 2) & (shift(status, type = "lag", fill = status[1]) != 2)
),
keyby = id
]
This should be faster if your data set is very large.