Home > Back-end >  Create lagged variables for consecutive time points only using R
Create lagged variables for consecutive time points only using R

Time:02-18

I have an unbalanced panel (with unequally spaced measurement points) and would like to create a lagged variable of x by group (Variable: id) but only for consecutive time points. My data looks like this:

# simple example with an unbalanced panel
base <- data.frame(id = rep(1:2, each = 7),
                  time = c(1, 2, 3, 4, 7, 8, 10, 3, 4, 6, 9, 10, 11, 14), 
                  x = rnorm(14, mean = 3, sd = 1))

I already tried this code using dplyr:

base_lag <- base %>% # Add lagged column
  group_by(id) %>%
  dplyr::mutate(lag1_x = dplyr::lag(x, n = 1, default = NA)) %>% 
  as.data.frame()
base_lag # Print updated data

However, this way I get a lagged variable regardless of the fact that in some cases it is not two consecutive time points.

My final data set should look like this:

   id time        x   lag1_x
1   1    1 3.437416       NA
2   1    2 2.300553 3.437416
3   1    3 2.374212 2.300553
4   1    4 4.374009 2.374212
5   1    7 1.177433       NA
6   1    8 1.543353 1.177433
7   1   10 3.222358       NA
8   2    3 3.763765       NA
9   2    4 3.881182 3.763765
10  2    6 4.754420       NA
11  2    9 4.518227       NA
12  2   10 2.512486 4.518227
13  2   11 3.129230 2.512486
14  2   14 2.152509       NA

Does anyone here have a tip for me on how to create this lagged variable? Many thanks in advance!

CodePudding user response:

You could use ifelse, testing whether diff(time) is equal to 1. If so, write the lag. If not, write an NA.

base %>%
  group_by(id) %>%
  mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>% 
  as.data.frame()
#>    id time        x   lag1_x
#> 1   1    1 1.852343       NA
#> 2   1    2 2.710538 1.852343
#> 3   1    3 2.700785 2.710538
#> 4   1    4 2.588489 2.700785
#> 5   1    7 3.252223       NA
#> 6   1    8 2.108079 3.252223
#> 7   1   10 3.435683       NA
#> 8   2    3 1.762462       NA
#> 9   2    4 2.775732 1.762462
#> 10  2    6 3.377396       NA
#> 11  2    9 3.133336       NA
#> 12  2   10 3.804190 3.133336
#> 13  2   11 2.942893 3.804190
#> 14  2   14 3.503608       NA

CodePudding user response:

An option is also to create a grouping based on the difference

library(dplyr)
base %>%
   group_by(id, grp = cumsum(c(TRUE, diff(time) != 1))) %>%
   mutate(lag1_x = lag(x)) %>%
   ungroup %>%
   select(-grp)

-output

# A tibble: 14 × 4
      id  time     x lag1_x
   <int> <dbl> <dbl>  <dbl>
 1     1     1  3.81  NA   
 2     1     2  2.79   3.81
 3     1     3  3.04   2.79
 4     1     4  1.76   3.04
 5     1     7  1.72  NA   
 6     1     8  2.68   1.72
 7     1    10  3.31  NA   
 8     2     3  2.92  NA   
 9     2     4  2.02   2.92
10     2     6  1.71  NA   
11     2     9  2.56  NA   
12     2    10  1.62   2.56
13     2    11  3.30   1.62
14     2    14  3.69  NA   
  • Related