Home > other >  How to track changes in a long format with dplyr?
How to track changes in a long format with dplyr?

Time:01-23

let's say I had the following dataset in long format, where the id variable represents the participants (group variable):

id wave car household
1 1 0 1
1 2 1 1
1 3 0 1
1 4 1 2
2 1 0 1
2 2 1 2
2 3 1 3
2 4 0 1
3 1 0 1
3 2 0 1
3 3 1 2
3 4 1 1
4 1 0 1
4 2 1 1
4 3 1 1
4 4 1 2

The variable "car" tells whether someone owns a car or not. The variable "household" indicates how many people live in the same household. As you can see, all participants start without owning a car and living alone in the household.

I now want to determine the changes longitudinally so that I end up with a) only those subjects who own a car and b) only those subjects who own a car live with only one other person (not two or more people) in the household.

In each case, only the first change should be counted and as soon as the car is sold (or more than two people live in the household), the data points should be excluded.

So condition a) would be fulfilled for example with proband id 1 to Wave 2. However, only this should be counted, since proband id 1 in Wave 3 no longer owns a car and the subsequent car purchase in Wave 4 represents the second purchase.

Condition b) would be fulfilled, for example, for proband id 2 at Wave 2, but from Wave 3 onwards there are also three people in the household, which is why the data points from Wave 3 onwards are to be excluded. Similarly, if another person moves into the household and you already have a car, a missing value in condition B should arise.

Whether condition a) and/ or condition b) apply is to be calculated in two separate binary variables (yes/no), named, for instance, "cond-a" and "cond-b".

Does anyone know how to do this most cleverly, for example with dplyr (or other R packages)? I would be extremely grateful for an answer!

I know, that I probably can do this with the group_by() function from dplyr, right?

Here is the code of the data.frame used in this example:

id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
wave <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
car <- c(0,1,0,1,0,1,1,0,0,0,1,1,0,1,0,1)
household <- c(1,1,1,2,1,2,3,1,1,1,2,1,1,2,1,2)

df <- data.frame(id,wave,car,household)

The expected output should be like:

id wave car household Cond-A Cond-B
1 1 0 1 0 0
1 2 1 1 1 0
1 3 0 1 0 0
1 4 1 2 NA NA
2 1 0 1 0 0
2 2 1 2 0 1
2 3 1 3 NA NA
2 4 0 1 0 0
3 1 0 1 0 0
3 2 0 1 0 0
3 3 1 2 0 1
3 4 1 1 1 NA
4 1 0 1 0 0
4 2 1 1 1 0
4 3 1 1 1 0
4 4 1 2 NA NA

Edit: Subject 1, Wave 4 is NA because she/ he had already owned a car before (see Wave 2). If car = 1 before and then car = 0 again in the meantime, the data points should be excluded from the second time car = 1 (for both Cond-A and Cond-B). Id 2, Wave 2 shows: If the change from car = 0 is not car = 1 & household = 1, but directly car = 1 & household = 2, then Cond-B shall apply, but not Cond-A. So Cond-A shall only apply if a change from car = 0 & household = 1 is to car = 1 & household = 1. I know this is a tricky question, but if anyone knows the answer, it's probably here! :)

CodePudding user response:

I revised my approach, I think it now comes very close to the desired outcome.

The last piece of logic that I don't understand is id 4, wave 4, why double NA?

The core logic is build into a tempporary variable called car_id which basically shows either NA if someone doesn't have a car at wave t or the id of the car (1, 2 etc.).

library(dplyr)

df %>% 
  group_by(id) %>% 
  mutate(car_id = rank(
           ifelse(car == 1,
                 data.table::rleid(car == 0),
                 NA),
           ties.method = "min",
           na.last = "keep"),
         
         condition_a = case_when(
           car_id == 1 & household == 1 ~ 1,
           car_id > 1 | household > 2 ~ NA_real_,
           (lag(car) == 0) & car_id == 1 & 
             (lag(household) == 1) & household == 2 ~ 0,
           TRUE ~ 0
         ),

         condition_b =
           case_when(
             lag(household) != 2 & household == 2 & car_id == 1 ~ 1,
             car_id > 1 | household > 2 ~ NA_real_,
             lag(household == 2) &  household != 2 ~ NA_real_,
             household != 0 ~ 0
           )
         ) %>% 
  select(!car_id)

#> # A tibble: 16 × 6
#> # Groups:   id [4]
#>       id  wave   car household condition_a condition_b
#>    <dbl> <dbl> <dbl>     <dbl>       <dbl>       <dbl>
#>  1     1     1     0         1           0           0
#>  2     1     2     1         1           1           0
#>  3     1     3     0         1           0           0
#>  4     1     4     1         2          NA          NA
#>  5     2     1     0         1           0           0
#>  6     2     2     1         2           0           1
#>  7     2     3     1         3          NA          NA
#>  8     2     4     0         1           0           0
#>  9     3     1     0         1           0           0
#> 10     3     2     0         1           0           0
#> 11     3     3     1         2           0           1
#> 12     3     4     1         1           1          NA
#> 13     4     1     0         1           0           0
#> 14     4     2     1         1           1           0
#> 15     4     3     1         1           1           0
#> 16     4     4     1         2           0           1

Created on 2023-01-22 with reprex v2.0.2

Data used from the actual table not the df object:

id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
wave <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
car <- c(0,1,0,1,0,1,1,0,0,0,1,1,0,1,1,1)
household <- c(1,1,1,2,1,2,3,1,1,1,2,1,1,1,1,2)

df <- data.frame(id,wave,car,household)
  • Related