How to create two events in unbalanced panel data?-CodePudding

My aim is to create two variables (unemployed and inactive) in an unbalanced panel data that shows are based on certain events. The status variable is 0= employed, 1= unemployed and 2=inactive. First, the unemployed variable that i would like to create shows a transition from only 0 to 1. The inactive variable that i would like to create shows a transition from only 0 to 2. If an individual starts with 1 or 2 on the first observation (e.g., like individual with id 6), these are not considered as events. Finally we can see that individuals might drop one or more waves, however, this should be irrelevant to the variables we want to create. In other words, our individual with id 1 has dropped wave 3 and 4, but then came back in wave 5 and we considered it as an unemployed event.

Here is the expected output:

    id wave status    unemployed inactive
1   1    1      0          0        0
2   1    2      0          0        0
3   1    5      1          1        0
4   2    1      0          0        0
5   2    2      0          0        0
6   3    1      0          0        0
7   3    2      2          0        1
8   3    3      1          0        0
9   4    1      0          0        0
10  4    2      2          0        1
11  4    4      0          0        0
12  5    1      2          0        0
13  5    3      0          0        0
14  5    5      1          1        0
15  5    6      1          0        0
16  6    1      1          0        0
17  6    3      2          0        1
18  6    5      2          0        0

Here is the data:

df=structure(list(id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 
    5, 5, 6, 6, 6), wave = c(1, 2, 5, 1, 2, 1, 2, 3, 1, 2, 4, 1, 
    3, 5, 6, 1, 3, 5), status = c(0, 0, 1, 0, 0, 0, 2, 1, 0, 2, 0, 
    2, 0, 1, 1, 1, 2, 2)), row.names = c(NA, -18L), class = c("tbl_df", 
    "tbl", "data.frame"))

CodePudding user response：

A dplyr solution:

library(dplyr)

df  |>
  group_by(id)  |>
  mutate(
    unemployed = (status == 1) & (lag(status, default = status[1]) == 0),
    inactive = (status == 2) & (lag(status, default = status[1]) != 2),
  )

# A tibble: 18 x 5
# Groups:   id [6]
#       id  wave status unemployed inactive
#    <dbl> <dbl>  <dbl> <lgl>      <lgl>
#  1     1     1      0 FALSE      FALSE
#  2     1     2      0 FALSE      FALSE
#  3     1     5      1 TRUE       FALSE
#  4     2     1      0 FALSE      FALSE
#  5     2     2      0 FALSE      FALSE
#  6     3     1      0 FALSE      FALSE
#  7     3     2      2 FALSE      TRUE
#  8     3     3      1 FALSE      FALSE
#  9     4     1      0 FALSE      FALSE
# 10     4     2      2 FALSE      TRUE
# 11     4     4      0 FALSE      FALSE
# 12     5     1      2 FALSE      FALSE
# 13     5     3      0 FALSE      FALSE
# 14     5     5      1 TRUE       FALSE
# 15     5     6      1 FALSE      FALSE
# 16     6     1      1 FALSE      FALSE
# 17     6     3      2 FALSE      TRUE
# 18     6     5      2 FALSE      FALSE

I have left them as logical rather than numeric variables as I think that is the appropriate data type in this case, but you can change that by wrapping the relevant part in as.numeric(), e.g. unemployed = as.numeric((status == 1) & (lag(status, default = status[1]) == 0)).

I have assumed that:

A person is unemployed only if they transition from being employed.
A person is inactive if they move to being inactive from being employed or unemployed.
A person should have the transition flag set to TRUE in the first period if they are inactive or unemployed in the first period - that is what default = status[1] is doing.

Also just for fun here is a data.table solution:

library(data.table)

dt  <- setDT(df)

dt[, 
   `:=` (
    unemployed = (status == 1) & (shift(status, type = "lag", fill = status[1]) == 0), 
    inactive = (status == 2) & (shift(status, type = "lag", fill = status[1]) != 2)
   ), 
   keyby = id
]

This should be faster if your data set is very large.