Home > front end >  lag() with group_by between current and last observation in R
lag() with group_by between current and last observation in R

Time:09-08

Edit: I found the solution with na.locf().

data <- 
  data %>%
  group_by(country) %>%
  arrange(wave) %>%
  mutate(weight.io = na.locf(weight)) %>%
  mutate(lag_weight = weight - lag(weight.io)

I have a dataset below.

set.seed(42000)
data <- data_frame(
  country = sample(letters[1:20], size = 100, replace = TRUE),
  weight = round(runif(100, min = 48, max = 90)))

data <- data %>%
  group_by(country)  %>%
  arrange(weight) %>%
  mutate(wave = seq_along(weight))

n_rows <- nrow(data)
perc_missing <- 10
data[sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows, 0))), c("weight")] <- NA 

I would like to obtain the difference between one country's current "weight" and the last observed "weight for each wave.
For country "a" wave 5, I want the value to be 69 - 65 (last observed weight at wave < 5). And for wave 8, 82(weight at wave 8) - 69(weight at wave 5).

enter image description here

My approach was the one below, but it didn't work.

data <- 
  data %>%
  group_by(country) %>%
  arrange(wave) %>%
  mutate(lag_weight = weight - lag(weight, default = first(weight, na.rm = TRUE))) 

Thank you!

CodePudding user response:

I think this is a combination of diff (instead of lag, though that could work just as well) and more important tidyr::fill (or zoo::na.locf, not demonstrated):

BTW, na.rm= is not an argument for first, I've removed it.

library(dplyr)
# library(tidyr) # fill
data %>%
  group_by(country) %>%
  tidyr::fill(weight) %>%
  filter(country == "a") %>%
  mutate(lag_weight = weight - lag(weight, default = first(weight)))
# # A tibble: 10 x 4
# # Groups:   country [1]
#    country weight  wave lag_weight
#    <chr>    <dbl> <int>      <dbl>
#  1 a           54     1          0
#  2 a           55     2          1
#  3 a           65     3         10
#  4 a           65     4          0
#  5 a           69     5          4
#  6 a           69     6          0
#  7 a           69     7          0
#  8 a           82     8         13
#  9 a           82     9          0
# 10 a           85    10          3

The issue here is that weight is over-written with the LOCF (last-observation carried forward) value instead of preserving the NA values. If that's important, then you can make another weight variable for temporary use (and remove it):

data %>%
  mutate(tmpweight = weight) %>%
  group_by(country) %>%
  tidyr::fill(tmpweight) %>%
  filter(country == "a") %>%
  mutate(lag_weight = tmpweight - lag(tmpweight, default = first(tmpweight))) %>%
  select(-tmpweight)
# # A tibble: 10 x 4
# # Groups:   country [1]
#    country weight  wave lag_weight
#    <chr>    <dbl> <int>      <dbl>
#  1 a           54     1          0
#  2 a           55     2          1
#  3 a           65     3         10
#  4 a           NA     4          0
#  5 a           69     5          4
#  6 a           NA     6          0
#  7 a           NA     7          0
#  8 a           82     8         13
#  9 a           82     9          0
# 10 a           85    10          3

FYI, you can use c(0, diff(weight)) instead of weight - lag(weight) for the same effect. Since it returns length of 1 shorter (since it is the gap between each value), we prepend a 0 here:

data %>%
  group_by(country) %>%
  tidyr::fill(weight) %>%
  filter(country == "a") %>%
  mutate(lag_weight = c(0, diff(weight)))

(The filter(country == "a") is purely for demonstration to match your example, not that it is required for this solution.)

  • Related