Home > Software engineering >  lagging columns by grouped column (R)
lagging columns by grouped column (R)

Time:01-05

I have created a data frame;

city_name             Date                 temp      hum    
   London       2022-01-01                    0       10
   London       2022-01-02                    7        8
   London       2022-01-03                    3        6
   London       2022-01-04                    7        7     
   London       2022-01-05                   10        8 
    Paris       2022-01-01                   11        3
    Paris       2022-01-02                   NA        2
    Paris       2022-01-03                    9       NA
    Paris       2022-01-04                   10        7
    Paris       2022-01-05                   11        9

what I am trying to do is lag temp and hum by 1 and 2 by city_name and put NA to lagged values if the Value is NA in temp or hum.

city_name             Date    temp       hum  temp_lag1  temp_lag2  hum_lag1   hum_lag2
   London       2022-01-01       0       10         NA         NA        NA         NA
   London       2022-01-02       7        8          0         NA        10         NA
   London       2022-01-03       3        6          7          0         8         10
   London       2022-01-04       7        7          3          7         6          8
   London       2022-01-05      10        8          7          3         7          6
    Paris       2022-01-01      11        3         NA         NA        NA         NA
    Paris       2022-01-02      NA        2         11         NA         3         NA
    Paris       2022-01-03       9       NA         NA         11         2          3
    Paris       2022-01-04      10        7          9         NA        NA          2
    Paris       2022-01-05      11        9         10          9         7         NA

What I have tried doing is;

  df <- df %>% 
      group_by(city_name) %>%                           
      dplyr::mutate(temp_lag1 = dplyr::lag(temp, n = 1, default=NA)) %>% 
      as.data.frame()
    
    df <- df %>% 
      group_by(city_name)                            
      dplyr::mutate(temp_lag2 = dplyr::lag(temp, n = 2, default=NA)) %>% 
      as.data.frame()
    
    df <- df %>% 
      group_by(city_name) %>%                           
      dplyr::mutate(hum_lag1 = dplyr::lag(hum, n = 1, default=NA)) %>% 
      as.data.frame()
    
    df <- df %>%
      group_by(city_name) %>%                             
      dplyr::mutate(rltv_hum_lag2 = dplyr::lag(rltv_hum, n = 2, default=NA)) %>% 
      as.data.frame()

it gives an error of: Error in dplyr::lag(temp, n = 1, default = NA) : object 'temp' not found

CodePudding user response:

Instead of doing this in multiple steps, use across to loop over the columns and return a list of columns

library(dplyr)
df <- df %>% 
   group_by(City_Name) %>% 
   mutate(across(c(temp, hum), list(lag1 = ~lag(.x, n = 1), 
        lag2 = ~ lag(.x, n = 2)))) %>%
    ungroup

-output

# A tibble: 10 × 8
   City_Name Date        temp   hum temp_lag1 temp_lag2 hum_lag1 hum_lag2
   <chr>     <chr>      <int> <int>     <int>     <int>    <int>    <int>
 1 London    2022-01-01     0    10        NA        NA       NA       NA
 2 London    2022-01-02     7     8         0        NA       10       NA
 3 London    2022-01-03     3     6         7         0        8       10
 4 London    2022-01-04     7     7         3         7        6        8
 5 London    2022-01-05    10     8         7         3        7        6
 6 Paris     2022-01-01    11     3        NA        NA       NA       NA
 7 Paris     2022-01-02    NA     2        11        NA        3       NA
 8 Paris     2022-01-03     9    NA        NA        11        2        3
 9 Paris     2022-01-04    10     7         9        NA       NA        2
10 Paris     2022-01-05    11     9        10         9        7       NA

data

df <- structure(list(City_Name = c("London", "London", "London", "London", 
"London", "Paris", "Paris", "Paris", "Paris", "Paris"), Date = c("2022-01-01", 
"2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", "2022-01-01", 
"2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05"), temp = c(0L, 
7L, 3L, 7L, 10L, 11L, NA, 9L, 10L, 11L), hum = c(10L, 8L, 6L, 
7L, 8L, 3L, 2L, NA, 7L, 9L)), class = "data.frame", row.names = c(NA, 
-10L))
  • Related