Home > database >  How to compare rows in a grouped tibble with different methods for the first and the last row of the
How to compare rows in a grouped tibble with different methods for the first and the last row of the

Time:04-22

this is my first question here. So please be merciful with me :-). I am grateful for any suggestions for improvement.

Here is my question:

I have a grouped and sorted tibble and I want to go through row by row and make comparisons with both the previous row and the next row. For this I have to treat the first and the last row differently. Each row has an interval defined by "start" and "end" and I want to find out if these intervals overlap.

The "if" function does not do what what I want it to do. I have tried a lot of things but I can't get a result. Maybe there is a much simpler solution.

Thanks in advance for helping!

library(tidyverse)
library(lubridate)
ID <- c(1, 1, 1, 1, 2, 2, 2, 2)
start <- ymd_hms(c("2022-04-15 10:10:00", "2022-04-15 10:15:00", "2022-04-15 10:35:00", "2022-04-15 10:50:00", "2022-04-15 11:20:00", "2022-04-15 11:35:00", "2022-04-15 11:45:00", "2022-04-15 11:50:00"))
end   <- ymd_hms(c("2022-04-15 10:20:00", "2022-04-15 10:30:00", "2022-04-15 10:40:00", "2022-04-15 10:55:00", "2022-04-15 11:30:00", "2022-04-15 11:40:00", "2022-04-15 11:55:00", "2022-04-15 11:55:00"))

data <- tibble(ID, start, end)

data %>% 
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(overlap = {
    if (row_number()==1) {lead(start) < end
    } else if (row_number()==n()) {start < lag(end)
    } else {(lead(start) < end) | (start < lag(end))}
  }) 


# A tibble: 8 x 4
# Groups:   ID [2]
     ID start               end                 overlap
  <dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 FALSE  
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 NA     
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 NA   

My expected output would be:

# A tibble: 8 x 4
# Groups:   ID [2]
ID start               end                 overlap
<dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE  
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE     
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE   

CodePudding user response:

if requires that its condition argument be exactly length 1, which is rarely the case in mutate, and especially here since we know we're dealing with more than 1 row at a time. Alternatives often include base::ifelse or dplyr::if_else or similar, though they are not quite right here.

I suggest since you want to look at previous and next rows, we should use dplyr's lag and lead functions, respectively. Since we need to treat the first/last rows specially, we can use vectorized boolean logic with & and |.

For dealing with the first/last rows specially, we have two options:

  1. comparing row_number() with 1 and n() for the first and last rows, respectively:

    data %>%
      group_by(ID) %>%
      arrange(start, .by_group = TRUE) %>%
      mutate(
        overlap = (row_number() > 1 & start < lag(end)) |
                    (row_number() < n() & end > lead(start))
      ) %>%
      ungroup()
    # # A tibble: 8 x 4
    #      ID start               end                 overlap
    #   <dbl> <dttm>              <dttm>              <lgl>  
    # 1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
    # 2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
    # 3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
    # 4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
    # 5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
    # 6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
    # 7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
    # 8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE   
    
  • set default= in lag/lead such that we get the desired intent.

    data %>%
      group_by(ID) %>%
      arrange(start, .by_group = TRUE) %>%
      mutate(
        overlap = (start < lag(end, default = start[1] - 1)) |
                    (end > lead(start, default = end[n()]   1))
      ) %>%
      ungroup()
    # # A tibble: 8 x 4
    #      ID start               end                 overlap
    #   <dbl> <dttm>              <dttm>              <lgl>  
    # 1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
    # 2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
    # 3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
    # 4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
    # 5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
    # 6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
    # 7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
    # 8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE   
    

Both methods are using strict-inequality (>/<), though you can change to >= (etc) just as easily depending on your needs.

CodePudding user response:

Here is another strategy using lubridates %within% and interval functions:

library(dplyr)
library(lubridate)

data %>% 
  group_by(ID) %>% 
  mutate(x = interval(start, end),
         y = start %within% lag(x),
         z = lead(y),
         overlap = case_when(y==TRUE | z==TRUE ~TRUE,
                              TRUE ~ FALSE)) %>% 
  select(-c(x,y,z))
     ID start               end                 overlap
  <dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE
  • Related