Home > Software design >  R create lag variable in unbalance time series dataframe
R create lag variable in unbalance time series dataframe

Time:04-04

I want to create a lag variable in my dataframe.

I have a list of time c("20110127", "20110128", "20110129", "20110130", "20110131", "20110201", "20110202").

My dataframe look like:

ID TIME VALUE
1 20110127 3.23
1 20110128 4.57
1 20110130 3.22
1 20110131 4.33
1 20110202 5.23
2 20110127 0.86
2 20110128 7.55
2 20110129 1.24
2 20110201 3.69

The year in the table is not consecutive. And I want to group_by ID then create a one-day lag variable. Which look like this:

ID TIME VALUE LAG
1 20110127 3.23 NA
1 20110128 4.57 3.23
1 20110130 3.22 NA
1 20110131 4.33 3.22
1 20110201 1.22 4.33
1 20110202 5.23 1.22
2 20110127 0.86 NA
2 20110128 7.55 0.86
2 20110129 1.24 7.55
2 20110201 3.69 NA

I tried

alltime = c("20110127", "20110128", "20110129", "20110130", "20110131", "20110201", "20110202")

last_day = data %>%
    arrange(ID, TIME) %>%
    group_by(ID) %>%
    mutate(lagtime = alltime[which(alltime == TIME) 1], lagoutput = value ,value = NULL)
data %>%
    left_join(last_day)

But it seems like the mutate function wouldn't work here. Maybe the problem is alltime[which(alltime == TIME) 1] and I cannot create a lag.

Appreciate any comment.

CodePudding user response:

You can use diff() and lubridate::ymd() to make a temporary column that indicates if there are gaps in the dates; then leverage that gap to get conditional lag of VALUE

data %>%
  group_by(ID) %>% 
  mutate(g = c(0,diff(lubridate::ymd(TIME))),
         LAG = if_else(g==1,lag(VALUE),as.double(NA))) %>% 
  select(-g)

Output:

     ID     TIME VALUE   LAG
  <int>    <int> <dbl> <dbl>
1     1 20110127  3.23 NA   
2     1 20110128  4.57  3.23
3     1 20110130  3.22 NA   
4     1 20110131  4.33  3.22
5     1 20110202  5.23 NA   
6     2 20110127  0.86 NA   
7     2 20110128  7.55  0.86
8     2 20110129  1.24  7.55
9     2 20110201  3.69 NA  

CodePudding user response:

We can use an ifelse statement. So, if the rows are 1 day apart (here I convert to a time format using lubridate), then return lag value and if not return NA.

library(tidyverse)
library(lubridate)

df %>%
  arrange(ID, TIME) %>%
  group_by(ID) %>%
  mutate(LAG = ifelse(ymd(TIME) - lag(ymd(TIME)) == 1, lag(VALUE), NA))

Output

      ID     TIME VALUE   LAG
   <int>    <int> <dbl> <dbl>
 1     1 20110127  3.23 NA   
 2     1 20110128  4.57  3.23
 3     1 20110130  3.22 NA   
 4     1 20110131  4.33  3.22
 5     1 20110201  1.22  4.33
 6     1 20110202  5.23  1.22
 7     2 20110127  0.86 NA   
 8     2 20110128  7.55  0.86
 9     2 20110129  1.24  7.55
10     2 20110201  3.69 NA   

Data

structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), 
    TIME = c(20110127L, 20110128L, 20110130L, 20110131L, 20110201L, 
    20110202L, 20110127L, 20110128L, 20110129L, 20110201L), VALUE = c(3.23, 
    4.57, 3.22, 4.33, 1.22, 5.23, 0.86, 7.55, 1.24, 3.69)), class = "data.frame", row.names = c(NA, 
-10L))
  • Related