this is my first question here. So please be merciful with me :-). I am grateful for any suggestions for improvement.
Here is my question:
I have a grouped and sorted tibble and I want to go through row by row and make comparisons with both the previous row and the next row. For this I have to treat the first and the last row differently. Each row has an interval defined by "start" and "end" and I want to find out if these intervals overlap.
The "if" function does not do what what I want it to do. I have tried a lot of things but I can't get a result. Maybe there is a much simpler solution.
Thanks in advance for helping!
library(tidyverse)
library(lubridate)
ID <- c(1, 1, 1, 1, 2, 2, 2, 2)
start <- ymd_hms(c("2022-04-15 10:10:00", "2022-04-15 10:15:00", "2022-04-15 10:35:00", "2022-04-15 10:50:00", "2022-04-15 11:20:00", "2022-04-15 11:35:00", "2022-04-15 11:45:00", "2022-04-15 11:50:00"))
end <- ymd_hms(c("2022-04-15 10:20:00", "2022-04-15 10:30:00", "2022-04-15 10:40:00", "2022-04-15 10:55:00", "2022-04-15 11:30:00", "2022-04-15 11:40:00", "2022-04-15 11:55:00", "2022-04-15 11:55:00"))
data <- tibble(ID, start, end)
data %>%
group_by(ID) %>%
arrange(start, .by_group = TRUE) %>%
mutate(overlap = {
if (row_number()==1) {lead(start) < end
} else if (row_number()==n()) {start < lag(end)
} else {(lead(start) < end) | (start < lag(end))}
})
# A tibble: 8 x 4
# Groups: ID [2]
ID start end overlap
<dbl> <dttm> <dttm> <lgl>
1 1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE
2 1 2022-04-15 10:15:00 2022-04-15 10:30:00 FALSE
3 1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE
4 1 2022-04-15 10:50:00 2022-04-15 10:55:00 NA
5 2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE
6 2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE
7 2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE
8 2 2022-04-15 11:50:00 2022-04-15 11:55:00 NA
My expected output would be:
# A tibble: 8 x 4
# Groups: ID [2]
ID start end overlap
<dbl> <dttm> <dttm> <lgl>
1 1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE
2 1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE
3 1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE
4 1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE
5 2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE
6 2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE
7 2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE
8 2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE
CodePudding user response:
if
requires that its condition argument be exactly length 1, which is rarely the case in mutate
, and especially here since we know we're dealing with more than 1 row at a time. Alternatives often include base::ifelse
or dplyr::if_else
or similar, though they are not quite right here.
I suggest since you want to look at previous and next rows, we should use dplyr's lag
and lead
functions, respectively. Since we need to treat the first/last rows specially, we can use vectorized boolean logic with &
and |
.
For dealing with the first/last rows specially, we have two options:
comparing
row_number()
with1
andn()
for the first and last rows, respectively:data %>% group_by(ID) %>% arrange(start, .by_group = TRUE) %>% mutate( overlap = (row_number() > 1 & start < lag(end)) | (row_number() < n() & end > lead(start)) ) %>% ungroup() # # A tibble: 8 x 4 # ID start end overlap # <dbl> <dttm> <dttm> <lgl> # 1 1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE # 2 1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE # 3 1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE # 4 1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE # 5 2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE # 6 2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE # 7 2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE # 8 2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE
set
default=
inlag
/lead
such that we get the desired intent.data %>% group_by(ID) %>% arrange(start, .by_group = TRUE) %>% mutate( overlap = (start < lag(end, default = start[1] - 1)) | (end > lead(start, default = end[n()] 1)) ) %>% ungroup() # # A tibble: 8 x 4 # ID start end overlap # <dbl> <dttm> <dttm> <lgl> # 1 1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE # 2 1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE # 3 1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE # 4 1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE # 5 2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE # 6 2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE # 7 2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE # 8 2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE
Both methods are using strict-inequality (>
/<
), though you can change to >=
(etc) just as easily depending on your needs.
CodePudding user response:
Here is another strategy using lubridate
s %within%
and interval
functions:
library(dplyr)
library(lubridate)
data %>%
group_by(ID) %>%
mutate(x = interval(start, end),
y = start %within% lag(x),
z = lead(y),
overlap = case_when(y==TRUE | z==TRUE ~TRUE,
TRUE ~ FALSE)) %>%
select(-c(x,y,z))
ID start end overlap
<dbl> <dttm> <dttm> <lgl>
1 1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE
2 1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE
3 1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE
4 1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE
5 2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE
6 2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE
7 2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE
8 2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE