How to compare rows in a grouped tibble with different methods for the first and the last row of the-CodePudding

this is my first question here. So please be merciful with me :-). I am grateful for any suggestions for improvement.

Here is my question:

I have a grouped and sorted tibble and I want to go through row by row and make comparisons with both the previous row and the next row. For this I have to treat the first and the last row differently. Each row has an interval defined by "start" and "end" and I want to find out if these intervals overlap.

The "if" function does not do what what I want it to do. I have tried a lot of things but I can't get a result. Maybe there is a much simpler solution.

Thanks in advance for helping!

library(tidyverse)
library(lubridate)
ID <- c(1, 1, 1, 1, 2, 2, 2, 2)
start <- ymd_hms(c("2022-04-15 10:10:00", "2022-04-15 10:15:00", "2022-04-15 10:35:00", "2022-04-15 10:50:00", "2022-04-15 11:20:00", "2022-04-15 11:35:00", "2022-04-15 11:45:00", "2022-04-15 11:50:00"))
end   <- ymd_hms(c("2022-04-15 10:20:00", "2022-04-15 10:30:00", "2022-04-15 10:40:00", "2022-04-15 10:55:00", "2022-04-15 11:30:00", "2022-04-15 11:40:00", "2022-04-15 11:55:00", "2022-04-15 11:55:00"))

data <- tibble(ID, start, end)

data %>% 
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(overlap = {
    if (row_number()==1) {lead(start) < end
    } else if (row_number()==n()) {start < lag(end)
    } else {(lead(start) < end) | (start < lag(end))}
  }) 


# A tibble: 8 x 4
# Groups:   ID [2]
     ID start               end                 overlap
  <dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 FALSE  
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 NA     
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 NA

My expected output would be:

# A tibble: 8 x 4
# Groups:   ID [2]
ID start               end                 overlap
<dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE  
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE     
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

CodePudding user response：

if requires that its condition argument be exactly length 1, which is rarely the case in mutate, and especially here since we know we're dealing with more than 1 row at a time. Alternatives often include base::ifelse or dplyr::if_else or similar, though they are not quite right here.

I suggest since you want to look at previous and next rows, we should use dplyr's lag and lead functions, respectively. Since we need to treat the first/last rows specially, we can use vectorized boolean logic with & and |.

For dealing with the first/last rows specially, we have two options:

comparing row_number() with 1 and n() for the first and last rows, respectively:

data %>%
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(
    overlap = (row_number() > 1 & start < lag(end)) |
                (row_number() < n() & end > lead(start))
  ) %>%
  ungroup()
# # A tibble: 8 x 4
#      ID start               end                 overlap
#   <dbl> <dttm>              <dttm>              <lgl>  
# 1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
# 2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
# 3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
# 4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
# 5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
# 6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
# 7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
# 8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

set default= in lag/lead such that we get the desired intent.

data %>%
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(
    overlap = (start < lag(end, default = start[1] - 1)) |
                (end > lead(start, default = end[n()]   1))
  ) %>%
  ungroup()
# # A tibble: 8 x 4
#      ID start               end                 overlap
#   <dbl> <dttm>              <dttm>              <lgl>  
# 1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
# 2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
# 3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
# 4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
# 5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
# 6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
# 7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
# 8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

Both methods are using strict-inequality (>/<), though you can change to >= (etc) just as easily depending on your needs.

CodePudding user response：

Here is another strategy using lubridates %within% and interval functions:

library(dplyr)
library(lubridate)

data %>% 
  group_by(ID) %>% 
  mutate(x = interval(start, end),
         y = start %within% lag(x),
         z = lead(y),
         overlap = case_when(y==TRUE | z==TRUE ~TRUE,
                              TRUE ~ FALSE)) %>% 
  select(-c(x,y,z))

     ID start               end                 overlap
  <dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE