Estimating likelihood of survey attrition relative to treatment-CodePudding

I have a panel survey data where each row represents an individual, their interview date, and labor market status during that period. However, it's an unbalanced panel data where some observations appear more than others (i.e. because some individuals stopped responding to the survey's organizers). Data was collected on individuals before and after some observations were randomly given a cash assistance benefit.

I am interested in knowing whether some individuals stopped responding to our survey specifically after they received the cash benefit (i.e. the treatment date which is on 2019-09-03)? In other words, I am interested in testing the probability of leaving the survey relative to the "date" variable but I am not sure how to do that.

Here is a data example. For instance, we can see that some individuals like Cartman who received treatment in Sept 2019 stopped responding to the survey in following years and thus their job market status is recorded as "N/A" Other observations in the control group like Mackey who did not receive the treatment continued responding to the survey in the following years.

individual    date        labor_status        cash_ benefit
Kenny       2018-09-02.   unemployed          0         
Kenny      2019-09-03.    unemployed          1      
Kenny      2020-09-07.    employed            1  
Kenny     2021-09-13.      employed           1                          
Cartman   2018-09-03.     unemployed          0     
Cartman  2019-09-06.      unemployed          1    
Cartman  2020-09-08.      N/A                 1
Cartman  2021-09-08.      N/A                 1
Mackey   2018-09-03.      employed            0  
Mackey   2019-09-04.      unemployed          0    
Mackey  2020-09-08.       employed            0
Mackey  2021-09-13.       employed            0

CodePudding user response：

If you’re looking to test this statistically, you should ask on Cross Validated. But if you just want the probability of dropout after 2019 conditional on receiving benefit:

library(dplyr)
library(lubridate)

dat %>%
  group_by(individual) %>%
  summarize(
    benefit = any(cash_benefit == 1),
    dropout_after_2019 = all(
      year(date) < 2019 |
      (year(date) == 2019 & !is.na(labor_status)) |
      is.na(labor_status)
    )
  ) %>%
  group_by(benefit) %>%
  summarize(p_dropout_after_2019 = mean(dropout_after_2019))

# A tibble: 2 × 2
  benefit p_dropout_after_2019
  <lgl>                  <dbl>
1 FALSE                    0  
2 TRUE                     0.5