Creating a variable in longitudinal data dependent on subsequent rows by ID in R-CodePudding

I have the following data:

df<-structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2), day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), x1 = c(15, 15, 15.2, 15.2, 
15.3, 15.2, 15.3, 15, 15, 15.2, 15.3, 12, 12.1, 12.3, 12.2, 12, 
12.4, 12.5, 12.4, 12.6, 12.7), x2 = c(1, 1, 0, 0, 0, 1, 0, 0, 
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -21L))

And I want to generate a variable that indicates a change from 1 to 0 in x2, but only if the following 4 rows remain 0 (by ID). As in the first occurrence of a change in x2 from 1 to 0 for at least 4 days. To generate the variable in this data:


df2<-structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2), day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), x1 = c(15, 15, 15.2, 15.2, 
15.3, 15.2, 15.3, 15, 15, 15.2, 15.3, 12, 12.1, 12.3, 12.2, 12, 
12.4, 12.5, 12.4, 12.6, 12.7), x2 = c(1, 1, 0, 0, 0, 1, 0, 0, 
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1), x3 = c(0, 0, 0, 0, 0, 
0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -21L))

Where x3 gets a value of 1 from the first occurrence of when x2 stops for at least 4 days, regardless of re-occurrence

I imagine there is a way to use lag or lead functions in dplyr, but I am not sure how to program the 'at least 4 days' condition. Any suggestions?

CodePudding user response：

We can use zoo::rollapply for a rolling-window calculation.

fun <- function(z)  (length(z) == 6 && z[1] == 1 && z[2] == 0 && all(z[-(1:2)] == 0))
df %>%
  group_by(ID) %>%
  mutate(x3a = cummax(zoo::rollapply(lead(x2), 6, fun, fill = 0))) %>%
  ungroup()
# # A tibble: 21 x 6
#       ID   day    x1    x2    x3   x3a
#    <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#  1     1     1  15       1     0     0
#  2     1     2  15       1     0     0
#  3     1     3  15.2     0     0     0
#  4     1     4  15.2     0     0     0
#  5     1     5  15.3     0     0     0
#  6     1     6  15.2     1     0     0
#  7     1     7  15.3     0     1     1
#  8     1     8  15       0     1     1
#  9     1     9  15       0     1     1
# 10     1    10  15.2     0     1     1
# # ... with 11 more rows

CodePudding user response：

A tidyverse solution could (also) look as follows:

library(dplyr)
library(tidyr)

df %>%
  group_by(ID) %>%
  mutate(grp = cumsum(x2)) %>%
  group_by(ID, grp) %>%
  mutate(fourOrMore = n() > 4,
         x3 =   lag(fourOrMore),
         x3 = replace_na(x3, 0)) %>%
  ungroup() %>%
  select(- c("grp", "fourOrMore"))

# # A tibble: 21 × 5
#      ID   day    x1    x2    x3
#   <dbl> <dbl> <dbl> <dbl> <int>
# 1     1     1  15       1     0
# 2     1     2  15       1     0
# 3     1     3  15.2     0     0
# 4     1     4  15.2     0     0
# 5     1     5  15.3     0     0
# 6     1     6  15.2     1     0
# 7     1     7  15.3     0     1
# 8     1     8  15       0     1
# 9     1     9  15       0     1
# 10    1    10  15.2     0     1
# # … with 11 more rows